Commit a57b5493 by Lianmin Zheng Committed by Tianqi Chen

[AUTOTVM][TOPI] Use tunable templates for GPU (CUDA/OpenCL/ROCm/Mali) (#1638)

parent d87c94d4
......@@ -6,8 +6,35 @@ See results on wiki page https://github.com/dmlc/tvm/wiki/Benchmark
## How to Reproduce
### ARM CPU
We use RPC infrastructure in TVM to make device management easy. So you need to use it for reproducing benchmark results.
To obtain the best performance, we always do auto-tuning for the specific devices and get
the parameters for used kernels. To enable easy reproduction of our results, we release
pre-tuned parameters for popular networks on some common devices.
TVM will download related tuning cache files during compilation.
If you don't have the following listed devices, you can still run these scripts.
You can pick the one that is most similar to your device as argument.
In general, the performance should also be good.
It is recommended that you run tuning by yourself if you have your customized network or devices.
Please follow the tutorial for
[NVIDIA GPU](https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_cuda.html),
[ARM CPU](https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html),
[Mobile GPU](https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_mobile_gpu.html).
### NVIDIA GPU
Build TVM with LLVM and CUDA enabled. [Help](https://docs.tvm.ai/install/from_source.html)
```bash
python3 gpu_imagenet_bench.py --model 1080ti
python3 gpu_imagenet_bench.py --model titanx
```
### ARM CPU & Mali GPU
For embedded deivces, we use RPC infrastructure in TVM to make the management easy.
So you need to use it for reproducing benchmark results.
0. Build TVM with LLVM enabled. [Help](https://docs.tvm.ai/install/from_source.html)
1. Start an RPC Tracker on the host machine
```bash
......@@ -50,24 +77,22 @@ python3 -m tvm.exec.rpc_tracker
rasp3b 8 8 0
```
4. Run benchmark
We did auto-tuning for Huawei P20/Mate10 Pro, Google Pixel2, Raspberry Pi3 and Firefly-RK3399,
and release pre-tuned parameters in [this repo](https://github.com/uwsaml/tvm-distro).
During compilation, TVM will download these operator parameters automatically.
4. Run benchmark
```bash
python3 arm_cpu_imagenet_bench.py --device rasp3b --rpc-key rasp3b
python3 arm_cpu_imagenet_bench.py --device rk3399 --rpc-key rk3399
python3 arm_cpu_imagenet_bench.py --device pixel2 --rpc-key pixel2
python3 arm_cpu_imagenet_bench.py --device p20pro --rpc-key p20pro
python3 arm_cpu_imagenet_bench.py --device mate10pro --rpc-key mate10pro
```
# ARM CPU
python3 arm_cpu_imagenet_bench.py --model rasp3b --rpc-key rasp3b
python3 arm_cpu_imagenet_bench.py --model rk3399 --rpc-key rk3399
python3 arm_cpu_imagenet_bench.py --model pixel2 --rpc-key pixel2
python3 arm_cpu_imagenet_bench.py --model p20pro --rpc-key p20pro
python3 arm_cpu_imagenet_bench.py --model mate10pro --rpc-key mate10pro
If your device has a same or similar SoC of the above devices, you can reuse these parameters.
For example, if your SoC is similar to rasp3b, use
```bash
python3 arm_cpu_imagenet_bench.py --device rasp3b --rpc-key your_custom_key
# Mali GPU
python3 mobile_gpu_imagenet_bench.py --model rk3399 --rpc-key rk3399
```
For other devices, to get the best performance, it is recommended that you tune your network by yourself.
Please follow this [tutorial](https://docs.tvm.ai/tutorials/autotvm/tune_nnvm_arm.html).
### AMD GPU
Build TVM with LLVM and ROCm enabled. [Help](https://docs.tvm.ai/install/from_source.html)
```bash
python3 gpu_imagenet_bench.py --model gfx900 --target rocm
```
"""Benchmark script for ARM CPU.
"""Benchmark script for ImageNet models on ARM CPU.
see README.md for the usage and results of this script.
"""
import argparse
......@@ -14,13 +14,60 @@ import nnvm.testing
from util import get_network, print_progress
def evaluate_network(network, target, target_host, number):
# connect to remote device
tracker = tvm.rpc.connect_tracker(args.host, args.port)
remote = tracker.request(args.rpc_key)
print_progress(network)
net, params, input_shape, output_shape = get_network(network, batch_size=1)
print_progress("%-20s building..." % network)
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target=target, target_host=target_host,
shape={'data': input_shape}, params=params, dtype=dtype)
tmp = tempdir()
if 'android' in str(target):
from tvm.contrib import ndk
filename = "%s.so" % network
lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:
filename = "%s.tar" % network
lib.export_library(tmp.relpath(filename))
# upload library and params
print_progress("%-20s uploading..." % network)
ctx = remote.context(str(target), 0)
remote.upload(tmp.relpath(filename))
rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}
rlib = remote.load_module(filename)
module = runtime.create(graph, rlib, ctx)
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input('data', data_tvm)
module.set_input(**rparams)
del rparams
# evaluate
print_progress("%-20s evaluating..." % network)
ftimer = module.module.time_evaluator("run", ctx, number=args.number, repeat=3)
prof_res = np.array(ftimer().results) * 1000 # multiply 1000 for converting to millisecond
print("%-20s %-19s (%s)" % (network, "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--network", type=str, choices=
['resnet-18', 'resnet-34', 'vgg-16', 'mobilenet', 'squeezenet v1.1', ])
parser.add_argument("--device", type=str, required=True, choices=
['resnet-18', 'resnet-34', 'vgg-16',
'mobilenet', 'mobilenet_v2', 'squeezenet v1.0', 'squeezenet v1.1'])
parser.add_argument("--model", type=str, choices=
['rk3399', 'mate10', 'mate10pro', 'p20', 'p20pro',
'pixel2', 'rasp3b', 'pynq'])
'pixel2', 'rasp3b', 'pynq'], default='rk3399',
help="The model of the test device. If your device is not listed in "
"the choices list, pick the most similar one as argument.")
parser.add_argument("--host", type=str, default='localhost')
parser.add_argument("--port", type=int, default=9190)
parser.add_argument("--rpc-key", type=str, required=True)
......@@ -34,47 +81,12 @@ if __name__ == "__main__":
else:
networks = [args.network]
target = tvm.target.arm_cpu(model=args.device)
# connect to remote device
tracker = tvm.rpc.connect_tracker(args.host, args.port)
remote = tracker.request(args.rpc_key)
target = tvm.target.arm_cpu(model=args.model)
target_host = None
print("--------------------------------------------------")
print("%-20s %-20s" % ("Network Name", "Mean Inference Time (std dev)"))
print("--------------------------------------------------")
for network in networks:
print_progress(network)
net, params, input_shape, output_shape = get_network(network, batch_size=1)
print_progress("%-20s building..." % network)
with nnvm.compiler.build_config(opt_level=2, add_pass=['AlterOpLayout']):
graph, lib, params = nnvm.compiler.build(
net, target=target, shape={'data': input_shape}, params=params, dtype=dtype)
tmp = tempdir()
if 'android' in str(target):
from tvm.contrib import ndk
filename = "%s.so" % network
lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:
filename = "%s.tar" % network
lib.export_library(tmp.relpath(filename))
# upload library and params
print_progress("%-20s uploading..." % network)
ctx = remote.context(str(target), 0)
remote.upload(tmp.relpath(filename))
rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}
rlib = remote.load_module(filename)
module = runtime.create(graph, rlib, ctx)
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input('data', data_tvm)
module.set_input(**rparams)
# evaluate
print_progress("%-20s evaluating..." % network)
ftimer = module.module.time_evaluator("run", ctx, number=args.number, repeat=3)
prof_res = np.array(ftimer().results) * 1000 # multiply 1000 for converting to millisecond
print("%-20s %-19s (%s)" % (network, "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))
evaluate_network(network, target, target_host, args.number)
""" Benchmark script for performance on GPUs.
For example, run the file with:
`python gpu_imagenet_bench.py --model=mobilenet --target=cuda`.
For more details about how to set up the inference environment on GPUs,
please refer to NNVM Tutorial: ImageNet Inference on the GPU
"""Benchmark script for ImageNet models on GPU.
see README.md for the usage and results of this script.
"""
import time
import argparse
import numpy as np
import tvm
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime
import nnvm.compiler
import nnvm.testing
from tvm.contrib import util, nvcc
from tvm.contrib import graph_runtime as runtime
@tvm.register_func
def tvm_callback_cuda_compile(code):
ptx = nvcc.compile_cuda(code, target="ptx")
return ptx
from util import get_network
def main():
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--model', type=str, required=True,
choices=['resnet', 'mobilenet'],
help="The model type.")
parser.add_argument('--target', type=str, required=True,
choices=['cuda', 'rocm', 'opencl', 'metal', 'nvptx'],
help="Compilation target.")
parser.add_argument('--opt-level', type=int, default=1, help="Level of optimization.")
parser.add_argument('--num-iter', type=int, default=1000, help="Number of iteration during benchmark.")
parser.add_argument('--repeat', type=int, default=1, help="Number of repeative times.")
parser.add_argument("--network", type=str, choices=
['resnet-18', 'resnet-34', 'resnet-50', 'vgg-16', 'vgg-19',
'inception_v3', 'mobilenet', 'mobilenet_v2', 'densenet-121'])
parser.add_argument("--model", type=str,
choices=['1080ti', 'titanx', 'gfx900'], default='1080ti',
help="The model of the test device. If your device is not listed in "
"the choices list, pick the most similar one as argument.")
parser.add_argument("--number", type=int, default=500)
parser.add_argument("--target", type=str,
choices=['cuda', 'opencl', 'rocm', 'nvptx', 'metal'], default='cuda',
help="The tvm compilation target")
args = parser.parse_args()
opt_level = args.opt_level
num_iter = args.num_iter
ctx = tvm.context(args.target, 0)
batch_size = 1
num_classes = 1000
image_shape = (3, 224, 224)
data_shape = (batch_size,) + image_shape
out_shape = (batch_size, num_classes)
if args.model == 'resnet':
net, params = nnvm.testing.resnet.get_workload(
batch_size=1, image_shape=image_shape)
elif args.model == 'mobilenet':
net, params = nnvm.testing.mobilenet.get_workload(
batch_size=1, image_shape=image_shape)
else:
raise ValueError('no benchmark prepared for {}.'.format(args.model))
dtype = 'float32'
if args.target == "cuda":
unroll = 1400
if args.network is None:
networks = ['resnet-50', 'mobilenet', 'vgg-19', 'inception_v3']
else:
unroll = 128
with nnvm.compiler.build_config(opt_level=opt_level):
with tvm.build_config(auto_unroll_max_step=unroll,
unroll_explicit=(args.target != "cuda")):
graph, lib, params = nnvm.compiler.build(
net, args.target, shape={"data": data_shape}, params=params)
networks = [args.network]
data = np.random.uniform(-1, 1, size=data_shape).astype("float32")
module = runtime.create(graph, lib, ctx)
module.set_input(**params)
module.set_input("data", data)
module.run()
out = module.get_output(0, tvm.nd.empty(out_shape))
out.asnumpy()
target = tvm.target.create('%s -model=%s' % (args.target, args.model))
print('benchmark args: {}'.format(args))
ftimer = module.module.time_evaluator("run", ctx, num_iter)
for i in range(args.repeat):
prof_res = ftimer()
print(prof_res)
# sleep for avoiding device overheat
if i + 1 != args.repeat:
time.sleep(45)
print("--------------------------------------------------")
print("%-20s %-20s" % ("Network Name", "Mean Inference Time (std dev)"))
print("--------------------------------------------------")
for network in networks:
net, params, input_shape, output_shape = get_network(network, batch_size=1)
if __name__ == '__main__':
main()
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target=target, shape={'data': input_shape}, params=params, dtype=dtype)
# create runtime
ctx = tvm.context(str(target), 0)
module = runtime.create(graph, lib, ctx)
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input('data', data_tvm)
module.set_input(**params)
# evaluate
ftimer = module.module.time_evaluator("run", ctx, number=args.number, repeat=3)
prof_res = np.array(ftimer().results) * 1000 # multiply 1000 for converting to millisecond
print("%-20s %-19s (%s)" % (network, "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))
"""Benchmark script for ImageNet models on mobile GPU.
see README.md for the usage and results of this script.
"""
import argparse
import numpy as np
import tvm
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime
import nnvm.compiler
import nnvm.testing
from util import get_network, print_progress
def evaluate_network(network, target, target_host, number):
# connect to remote device
tracker = tvm.rpc.connect_tracker(args.host, args.port)
remote = tracker.request(args.rpc_key)
print_progress(network)
net, params, input_shape, output_shape = get_network(network, batch_size=1)
print_progress("%-20s building..." % network)
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target=target, target_host=target_host,
shape={'data': input_shape}, params=params, dtype=dtype)
tmp = tempdir()
if 'android' in str(target) or 'android' in str(target_host):
from tvm.contrib import ndk
filename = "%s.so" % network
lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:
filename = "%s.tar" % network
lib.export_library(tmp.relpath(filename))
# upload library and params
print_progress("%-20s uploading..." % network)
ctx = remote.context(str(target), 0)
remote.upload(tmp.relpath(filename))
rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}
rlib = remote.load_module(filename)
module = runtime.create(graph, rlib, ctx)
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input('data', data_tvm)
module.set_input(**rparams)
del rparams
# evaluate
print_progress("%-20s evaluating..." % network)
ftimer = module.module.time_evaluator("run", ctx, number=number, repeat=3)
prof_res = np.array(ftimer().results) * 1000 # multiply 1000 for converting to millisecond
print("%-20s %-19s (%s)" % (network, "%.2f ms" % np.mean(prof_res), "%.2f ms" % np.std(prof_res)))
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--network", type=str, choices=
['resnet-18', 'resnet-34', 'vgg-16',
'mobilenet', 'mobilenet_v2', 'squeezenet v1.1'])
parser.add_argument("--model", type=str, choices=
['rk3399'], default='rk3399',
help="The model of the test device. If your device is not listed in "
"the choices list, pick the most similar one as argument.")
parser.add_argument("--host", type=str, default='localhost')
parser.add_argument("--port", type=int, default=9190)
parser.add_argument("--rpc-key", type=str, required=True)
parser.add_argument("--number", type=int, default=10)
args = parser.parse_args()
dtype = 'float32'
if args.network is None:
networks = ['squeezenet_v1.1', 'mobilenet', 'resnet-18', 'vgg-16']
else:
networks = [args.network]
target = tvm.target.mali(model=args.model)
target_host = tvm.target.arm_cpu(model=args.model)
print("--------------------------------------------------")
print("%-20s %-20s" % ("Network Name", "Mean Inference Time (std dev)"))
print("--------------------------------------------------")
for network in networks:
evaluate_network(network, target, target_host, args.number)
\ No newline at end of file
......@@ -27,20 +27,25 @@ def get_network(name, batch_size):
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:
if name == 'mobilenet':
net, params = nnvm.testing.mobilenet.get_workload(batch_size=batch_size)
elif name == 'mobilenet_v2':
net, params = nnvm.testing.mobilenet_v2.get_workload(batch_size=batch_size)
elif name == 'inception_v3':
input_shape = (1, 3, 299, 299)
net, params = nnvm.testing.inception_v3.get_workload(batch_size=batch_size)
elif "resnet" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size)
elif "vgg" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.vgg.get_workload(num_layers=n_layer, batch_size=batch_size)
elif name == 'mobilenet':
net, params = nnvm.testing.mobilenet.get_workload(batch_size=batch_size)
elif "densenet" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.densenet.get_workload(num_layers=n_layer, batch_size=batch_size)
elif "squeezenet" in name:
version = name.split("_v")[1]
net, params = nnvm.testing.squeezenet.get_workload(batch_size=batch_size, version=version)
elif name == 'inception_v3':
input_shape = (1, 3, 299, 299)
net, params = nnvm.testing.inception_v3.get_workload(batch_size=batch_size)
elif name == 'custom':
# an example for custom network
from nnvm.testing import utils
......
......@@ -8,6 +8,7 @@ from . import mobilenet_v2
from . import mlp
from . import resnet
from . import vgg
from . import densenet
from . import squeezenet
from . import inception_v3
from . import dcgan
......
"""
DenseNet, load model from gluon model zoo
Reference:
Huang, Gao, et al. "Densely Connected Convolutional Networks." CVPR 2017
"""
from .utils import create_workload
from ..frontend.mxnet import _from_mxnet_impl
def get_workload(batch_size, num_classes=1000, num_layers=121, dtype="float32"):
"""Get benchmark workload for mobilenet
Parameters
----------
batch_size : int
The batch size used in the model
num_classes : int, optional
Number of classes
num_layers : int, optional
The number of layers
dtype : str, optional
The data type
Returns
-------
net : nnvm.Symbol
The computational graph
params : dict of str to NDArray
The parameters.
"""
import mxnet as mx
from mxnet.gluon.model_zoo.vision import get_model
image_shape = (1, 3, 224, 224)
block = get_model('densenet%d' % num_layers, classes=num_classes, pretrained=False)
data = mx.sym.Variable('data')
sym = block(data)
sym = mx.sym.SoftmaxOutput(sym)
net = _from_mxnet_impl(sym, {})
return create_workload(net, batch_size, image_shape[1:], dtype)
......@@ -46,18 +46,16 @@ def residual_unit(data, num_filter, stride, dim_match, name, bottle_neck=True):
Base name of the operators
"""
if bottle_neck:
# the same as https://github.com/facebook/fb.resnet.torch#notes,
# a bit difference with origin paper
bn1 = sym.batch_norm(data=data, epsilon=2e-5, name=name + '_bn1')
act1 = sym.relu(data=bn1, name=name + '_relu1')
conv1 = sym.conv2d(
data=act1, channels=int(num_filter*0.25), kernel_size=(1, 1),
strides=(1, 1), padding=(0, 0), use_bias=False, name=name + '_conv1')
strides=stride, padding=(0, 0), use_bias=False, name=name + '_conv1')
bn2 = sym.batch_norm(data=conv1, epsilon=2e-5, name=name + '_bn2')
act2 = sym.relu(data=bn2, name=name + '_relu2')
conv2 = sym.conv2d(
data=act2, channels=int(num_filter*0.25), kernel_size=(3, 3),
strides=stride, padding=(1, 1), use_bias=False, name=name + '_conv2')
strides=(1, 1), padding=(1, 1), use_bias=False, name=name + '_conv2')
bn3 = sym.batch_norm(data=conv2, epsilon=2e-5, name=name + '_bn3')
act3 = sym.relu(data=bn3, name=name + '_relu3')
conv3 = sym.conv2d(
......
......@@ -46,14 +46,13 @@ def residual_unit(data, num_filter, stride, dim_match, name, bottle_neck=True, b
Workspace used in convolution operator
"""
if bottle_neck:
# the same as https://github.com/facebook/fb.resnet.torch#notes, a bit difference with origin paper
bn1 = mx.sym.BatchNorm(data=data, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn1')
act1 = mx.sym.Activation(data=bn1, act_type='relu', name=name + '_relu1')
conv1 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(1,1), stride=(1,1), pad=(0,0),
conv1 = mx.sym.Convolution(data=act1, num_filter=int(num_filter*0.25), kernel=(1,1), stride=stride, pad=(0,0),
no_bias=True, workspace=workspace, name=name + '_conv1')
bn2 = mx.sym.BatchNorm(data=conv1, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn2')
act2 = mx.sym.Activation(data=bn2, act_type='relu', name=name + '_relu2')
conv2 = mx.sym.Convolution(data=act2, num_filter=int(num_filter*0.25), kernel=(3,3), stride=stride, pad=(1,1),
conv2 = mx.sym.Convolution(data=act2, num_filter=int(num_filter*0.25), kernel=(3,3), stride=(1,1), pad=(1,1),
no_bias=True, workspace=workspace, name=name + '_conv2')
bn3 = mx.sym.BatchNorm(data=conv2, fix_gamma=False, eps=2e-5, momentum=bn_mom, name=name + '_bn3')
act3 = mx.sym.Activation(data=bn3, act_type='relu', name=name + '_relu3')
......
......@@ -164,6 +164,31 @@ def measure_option(builder, runner):
Specify how to build programs
runner: Runner
Specify how to run programs
Examples
--------
# example setting for using local devices
>>> measure_option = autotvm.measure_option(
>>> builder=autotvm.LocalBuilder(), # use all local cpu cores for compilation
>>> runner=autotvm.LocalRunner( # measure them sequentially
>>> number=10,
>>> timeout=5)
>>> )
# example setting for using remote devices
>>> measure_option = autotvm.measure_option(
>>> builder=autotvm.LocalBuilder(), # use all local cpu cores for compilation
>>> runner=autotvm.RPCRunner(
>>> 'rasp3b', 'locahost', 9190, # device key, host and port of the rpc tracker
>>> number=4,
>>> timeout=4) # timeout of a run on the device. RPC request waiting time is excluded.
>>>)
Note
----
To make measurement results accurate, you should pick the correct value for the argument
`number` and `repeat` in Runner(). Using `min_repeat_ms` can dynamically adjusts `number`,
so it is recommended. The typical value for NVIDIA GPU is 100 ms.
"""
from .measure_methods import LocalBuilder, LocalRunner
......
......@@ -72,12 +72,15 @@ class LocalBuilder(Builder):
raise ValueError("Invalid build_func" + build_func)
self.build_func = build_func
self.tmp_dir = tempfile.mkdtemp()
self.executor = LocalExecutor(timeout=timeout)
self.tmp_dir = tempfile.mkdtemp()
def build(self, measure_inputs):
results = []
shutil.rmtree(self.tmp_dir)
self.tmp_dir = tempfile.mkdtemp()
for i in range(0, len(measure_inputs), self.n_parallel):
futures = []
for inp in measure_inputs[i:i + self.n_parallel]:
......@@ -95,7 +98,7 @@ class LocalBuilder(Builder):
results.append(MeasureResult((res,), MeasureErrorNo.BUILD_TIMEOUT,
self.timeout, time.time()))
elif res.error is not None:
# instantiation errorD
# instantiation error
if isinstance(res.error, InstantiationError):
results.append(MeasureResult((res.error,),
MeasureErrorNo.INSTANTIATION_ERROR,
......@@ -120,9 +123,6 @@ class LocalBuilder(Builder):
return results
def __del__(self):
shutil.rmtree(self.tmp_dir)
class RPCRunner(Runner):
"""Run generated code on remove devices.
......@@ -519,7 +519,7 @@ def request_remote(device_key, host=None, port=None, priority=1, timeout=60):
return remote
def check_remote(target, device_key, host=None, port=None, priority=2, timeout=10):
def check_remote(target, device_key, host=None, port=None, priority=100, timeout=10):
"""
Check the availability of a remote device
......
......@@ -271,7 +271,7 @@ if __name__ == '__main__':
parser.add_argument("--code", action='store_true')
args = parser.parse_args()
logger.basicConfig(level=logger.INFO)
logging.basicConfig(level=logger.INFO)
if args.mode == 'pick':
args.o = args.o or args.i + ".best.log"
......
......@@ -9,7 +9,8 @@ of typical tasks of interest.
from .task import Task, create, register, template, get_config, args_to_workload
from .space import ConfigSpace, ConfigEntity
from .code_hash import attach_code_hash, attach_code_hash_to_arg
from .dispatcher import DispatchContext, ApplyConfig, ApplyHistoryBest, FallbackContext, dispatcher
from .dispatcher import dispatcher, DispatchContext, ApplyConfig, ApplyHistoryBest, \
FallbackContext, clear_fallback_cache
from .topi_integration import register_topi_compute, register_topi_schedule
from .nnvm_integration import extract_from_graph
from .nnvm_integration import extract_from_graph, extract_from_multiple_graph
......@@ -289,15 +289,20 @@ class FallbackContext(DispatchContext):
self.memory = {}
self.silent = False
# a set to prevent print duplicated message
self.messages = set()
def _query_inside(self, target, workload):
key = (str(target), workload)
if key in self.memory:
return self.memory[key]
if not self.silent:
logger.warning(
"Cannot find config for target=%s, workload=%s. A fallback configuration "
"is used, which may bring great performance regression.", target, workload)
msg = "Cannot find config for target=%s, workload=%s. A fallback configuration "\
"is used, which may bring great performance regression." % (target, workload)
if msg not in self.messages:
self.messages.add(msg)
logger.warning(msg)
cfg = FallbackConfigEntity()
# cache this config
......@@ -320,3 +325,23 @@ class FallbackContext(DispatchContext):
del self.memory[key]
DispatchContext.current = FallbackContext()
def clear_fallback_cache(target, workload):
"""Clear fallback cache. Pass the same argument as _query_inside to this function
to clean the cache.
Parameters
----------
target: Target
The current target
workload : Workload
The current workload.
Note
----
This is used in alter_op_layout to clear the bad cache created before call topi compute function
"""
context = DispatchContext.current
while not isinstance(context, FallbackContext):
context = context._old_ctx
context.clear_cache(target, workload)
......@@ -208,7 +208,7 @@ def extract_from_graph(graph, shape, dtype, target, symbols, target_host=None):
----------
graph : Graph
The graph to tune
shape : dict of str to tuple, optional
shape : dict of str to tuple
The input shape to the graph
dtype : str or dict of str to str
The input types to the graph
......@@ -249,6 +249,69 @@ def extract_from_graph(graph, shape, dtype, target, symbols, target_host=None):
logger.disabled = old_state
# create tasks for target
tasks = []
for task_name, args in env.get_tasks():
tasks.append(create(task_name, args,
target=target, target_host=target_host,
template_key='direct'))
return tasks
def extract_from_multiple_graph(graphs, shapes, dtypes, target, symbols, target_host=None):
""" Extract tuning tasks from multiple nnvm graphs.
This function is the multiple graph version of extract_from_graph
Parameters
----------
graphs : List of Graph
The list of graphs to tune
shapes : List of dict of str to tuple
The input shape to the graph
dtypes : List of str or dict of str to str
The input types to the graph
target: tvm.target.Target
The compilation target
symbols : Array of nnvm.symbol
Array of nnvm symbols want to be tuned
target_host: tvm.target.Target
The host compilation target
Returns
-------
task: Array of autotvm.task.Task
collected tasks
"""
import nnvm.compiler
env = TaskExtractEnv.get()
topi_funcs = []
for sym_name in symbols:
if sym_name in env.symbol2topi:
topi_funcs.extend(env.symbol2topi[sym_name])
else:
warnings.warn("Symbol %s is not tunable, ignored" % sym_name)
# run compiler to collect all TOPI calls during compilation
env.reset(topi_funcs)
# disable logger temporarily
old_state = logger.disabled
logger.disabled = True
# use a "tracing" target to do a fake compile for collecting topi calls
tracing_target = _target.create("llvm -device=tracing")
nnvm.compiler.engine.clear_cache()
for graph, shape, dtype in zip(graphs, shapes, dtypes):
nnvm.compiler.build(graph, target=tracing_target, shape=shape, dtype=dtype)
logger.disabled = old_state
# create tasks for target
tasks = []
for task_name, args in env.get_tasks():
tasks.append(create(task_name, args,
......
......@@ -900,6 +900,7 @@ class ConfigEntity(ConfigSpace):
return "%s,%s,%s,%d" % (str(self._entity_map)[12:-1], self.template_key,
self.code_hash, self.index)
class FallbackConfigEntity(ConfigSpace):
"""The config entity created to support fallback"""
......@@ -926,18 +927,74 @@ class FallbackConfigEntity(ConfigSpace):
Then cfg.fallback_split('tile_0', [-1, 8, 4]) will give you cfg['tile_0'].size = [7, 7, 1]
"""
space = self.space_map[name]
assert isinstance(space, SplitSpace)
assert len(constraints) == space.num_outputs
indices = np.arange(space.num_outputs)
# '-1' means no constraint
constraints = [x if x != -1 else 1e10 for x in constraints]
for entity in reversed(space.entities):
if all([entity.size[i] <= constraints[i] for i in indices]):
self._entity_map[name] = entity
return
entity = self._entity_map[name]
now = space.product
for i in reversed(range(space.num_outputs)):
factors = get_factors(now)
find = len(factors) - 1
for j, f in enumerate(factors):
if f > constraints[i]:
find = j - 1
break
if find >= 0:
entity.size[i] = factors[find]
now //= factors[find]
else:
raise RuntimeError("Cannot find feasible fallback split entity for node: " + name)
def fallback_with_reference_log(self, ref_log):
"""A data driven fallback mechanism.
We use tuned parameters from TopHub as reference data.
For an unseen shape, we find the most similar tuned one from TopHub and
mimic its parameters.
Parameters
----------
ref_log: List of (MeasureInput, MeasureResult)
The reference log
"""
knob_names = [x for x in self.space_map.keys() if
isinstance(self.space_map[x], SplitSpace)]
# find best match config in reference data by matching tiling factors
factor_list = []
for knob_name in knob_names:
factor_list.append(get_factors(self.space_map[knob_name].product))
best_match_cfg = None
best_match_score = 0
for inp, _ in ref_log:
match_score = 0
for i, knob_name in enumerate(knob_names):
factors = get_factors(int(np.prod(inp.config[knob_name].size)))
match_score += (float(len(set(factor_list[i]).intersection(factors))) /
len(factor_list[i]))
if match_score > best_match_score:
best_match_score, best_match_cfg = match_score, inp.config
if best_match_cfg is None:
return
# mimic its tiling strategy
for knob_name in knob_names:
constraint = list(best_match_cfg[knob_name].size)
constraint[0] = -1
self.fallback_split(knob_name, constraint)
raise RuntimeError("Cannot find feasible fallback split entity for node: " + name)
# copy other knobs
for knob_name in self.space_map.keys():
if not isinstance(self.space_map[knob_name], SplitSpace):
self._entity_map[knob_name] = best_match_cfg[knob_name]
def __repr__(self):
return "%s,%s,%s" % (str(self._entity_map)[12:-1], self.template_key, self.code_hash)
......@@ -2,7 +2,7 @@
TopHub: Tensor Operator Hub
To get the best performance, we typically need auto-tuning for the specific devices.
TVM releases pre-tuned parameters in TopHub for some common networks and hardware targets.
TVM will download these parameters for you when you create the target for the first time.
TVM will download these parameters for you when you call nnvm.compiler.build_module .
"""
# pylint: disable=invalid-name
......@@ -13,15 +13,21 @@ import sys
from .task import ApplyHistoryBest
from .. import target as _target
from ..contrib.download import download
from .record import load_from_file
# root path to store TopHub files
AUTOTVM_TOPHUB_ROOT_PATH = os.path.join(os.path.expanduser('~'), ".tvm", "tophub")
# the version of each package
PACKAGE_VERSION = {
'vta': "v0.01",
'arm_cpu': "v0.01",
'cuda': "v0.01",
'cuda': "v0.02",
'rocm': "v0.01",
'opencl': "v0.01",
'mali': "v0.01",
'vta': "v0.01",
}
logger = logging.getLogger('autotvm')
......@@ -30,6 +36,9 @@ def _alias(name):
"""convert alias for some packages"""
table = {
'vtacpu': 'vta',
'metal': 'opencl',
'nvptx': 'cuda'
}
return table.get(name, name)
......@@ -60,6 +69,7 @@ def context(target, extra_files=None):
all_packages = list(PACKAGE_VERSION.keys())
for name in possible_names:
name = _alias(name)
if name in all_packages:
check_backend(name)
......@@ -121,3 +131,51 @@ def download_package(package_name):
logger.info("Download pre-tuned parameters package %s", package_name)
download("https://raw.githubusercontent.com/uwsaml/tvm-distro/master/tophub/%s"
% package_name, os.path.join(rootpath, package_name), True, verbose=0)
# global cache for load_reference_log
REFERENCE_LOG_CACHE = {}
def load_reference_log(backend, model, workload_name, template_key):
""" Load reference log from TopHub to support fallback in template.
Template will use these reference logs to choose fallback config.
Parameters
----------
backend: str
The backend name
model: str
The name of the model
workload_name: str
The name of the workload. (The first item in the workload tuple)
template_key: str
The template key
"""
backend = _alias(backend)
version = PACKAGE_VERSION[backend]
package_name = "%s_%s.log" % (backend, version)
filename = os.path.join(AUTOTVM_TOPHUB_ROOT_PATH, package_name)
global REFERENCE_LOG_CACHE
key = (backend, model, workload_name, template_key)
if key not in REFERENCE_LOG_CACHE:
tmp = []
if os.path.isfile(os.path.join(AUTOTVM_TOPHUB_ROOT_PATH, package_name)):
find = False
inp = None
for inp, res in load_from_file(filename):
if model == inp.target.model:
find = True
break
if not find and inp:
model = inp.target.model
for inp, res in load_from_file(filename):
if (model == inp.target.model and inp.task.workload[0] == workload_name and
inp.config.template_key == template_key):
tmp.append((inp, res))
REFERENCE_LOG_CACHE[key] = tmp
return REFERENCE_LOG_CACHE[key]
......@@ -34,6 +34,7 @@ class Tuner(object):
# time to leave
self.ttl = None
self.n_trial = None
self.early_stopping = None
def has_next(self):
"""Whether has next untried config in the space
......@@ -92,6 +93,7 @@ class Tuner(object):
n_parallel = getattr(measure_batch, 'n_parallel', 1)
early_stopping = early_stopping or 1e9
self.n_trial = n_trial
self.early_stopping = early_stopping
old_level = logger.level
......@@ -127,18 +129,18 @@ class Tuner(object):
res, config)
i += len(results)
self.ttl = min(early_stopping + self.best_iter, n_trial) - i
self.update(inputs, results)
for callback in callbacks:
callback(self, inputs, results)
self.ttl = min(early_stopping + self.best_iter, n_trial) - i
if i >= self.best_iter + early_stopping:
logger.debug("Early stopped. Best iter: %d.", self.best_iter)
break
if error_ct > 150:
logging.basicConfig()
logger.warning("Too many errors happen in the tuning. Now is in debug mode")
logger.setLevel(logging.DEBUG)
else:
......
......@@ -53,7 +53,7 @@ class XGBoostCostModel(CostModel):
upper_model: XGBoostCostModel, optional
The upper model used in transfer learning
"""
def __init__(self, task, feature_type, loss_type, num_threads=4, log_interval=25,
def __init__(self, task, feature_type, loss_type, num_threads=None, log_interval=25,
upper_model=None):
super(XGBoostCostModel, self).__init__()
......
......@@ -64,8 +64,8 @@ def download(url, path, overwrite=False, size_compare=False, verbose=1):
progress_size = int(count * block_size)
speed = int(progress_size / (1024 * duration))
percent = min(int(count * block_size * 100 / total_size), 100)
sys.stdout.write("\r...%d%%, %d MB, %d KB/s, %d seconds passed" %
(percent, progress_size / (1024 * 1024), speed, duration))
sys.stdout.write("\r...%d%%, %.2f MB, %d KB/s, %d seconds passed" %
(percent, progress_size / (1024.0 * 1024), speed, duration))
sys.stdout.flush()
if sys.version_info >= (3,):
......
......@@ -105,6 +105,13 @@ class Target(NodeBase):
self._libs = [l.value for l in self.libs_array]
return self._libs
@property
def model(self):
for opt in self.options_array:
if opt.value.startswith('-model='):
return opt.value[7:]
return 'unknown'
def __enter__(self):
_api_internal._EnterTargetScope(self)
return self
......@@ -354,52 +361,60 @@ def generic_func(fdefault):
return fdecorate
def cuda(options=None):
def cuda(model='unknown', options=None):
"""Returns a cuda target.
Parameters
----------
model: str
The model of cuda device (e.g. 1080ti)
options : str or list of str
Additional options
"""
options = _merge_opts([], options)
return _api_internal._TargetCreate("cuda", *options)
opts = _merge_opts(['-model=%s' % model], options)
return _api_internal._TargetCreate("cuda", *opts)
def rocm(options=None):
def rocm(model='unknown', options=None):
"""Returns a ROCM target.
Parameters
----------
model: str
The model of this device
options : str or list of str
Additional options
"""
options = _merge_opts([], options)
return _api_internal._TargetCreate("rocm", *options)
opts = _merge_opts(["-model=%s" % model], options)
return _api_internal._TargetCreate("rocm", *opts)
def mali(options=None):
def mali(model='unknown', options=None):
"""Returns a ARM Mali GPU target.
Parameters
----------
model: str
The model of this device
options : str or list of str
Additional options
"""
opts = ["-device=mali"]
opts = ["-device=mali", '-model=%s' % model]
opts = _merge_opts(opts, options)
return _api_internal._TargetCreate("opencl", *opts)
def intel_graphics(options=None):
def intel_graphics(model='unknown', options=None):
"""Returns an Intel Graphics target.
Parameters
----------
model: str
The model of this device
options : str or list of str
Additional options
"""
opts = ["-device=intel_graphics"]
opts = ["-device=intel_graphics", '-model=%s' % model]
opts = _merge_opts(opts, options)
return _api_internal._TargetCreate("opencl", *opts)
......@@ -436,6 +451,7 @@ def arm_cpu(model='unknown', options=None):
"rasp3b": ["-model=bcm2837", "-target=armv7l-linux-gnueabihf -mattr=+neon"],
"rk3399": ["-model=rk3399", "-target=aarch64-linux-gnu -mattr=+neon"],
"pynq": ["-model=pynq", "-target=armv7a-linux-eabi -mattr=+neon"],
"ultra96": ["-model=ultra96", "-target=aarch64-linux-gnu -mattr=+neon"],
}
pre_defined_opt = trans_table.get(model, ["-model=%s" % model])
......@@ -494,5 +510,4 @@ def current_target(allow_none=True):
------
ValueError if current target is not set.
"""
target_str = _api_internal._GetCurrentTarget(allow_none)
return create(target_str) if target_str is not None else None
return _api_internal._GetCurrentTarget(allow_none)
......@@ -583,8 +583,7 @@ class Canonical::Internal : public IRMutator {
while (i < suma->elem.size() && j < sumb->elem.size()) {
const auto& a = suma->elem[i];
const auto& b = sumb->elem[j];
if (a.value.same_as(b.value)) {
CHECK_EQ(a.level, b.level);
if (a.value.same_as(b.value) && a.level == b.level) {
ComExprEntry e = a;
e.scale = a.scale + b.scale * bscale;
if (e.scale != 0) {
......
......@@ -252,11 +252,8 @@ void OpenCLWorkspace::Init(const std::string& type_key, const std::string& devic
this->platform_name = cl::GetPlatformInfo(platform_id, CL_PLATFORM_NAME);
this->device_type = device_type;
this->devices = devices_matched;
LOG(INFO) << "Initialize OpenCL platform \'" << this->platform_name << '\'';
break;
}
LOG(INFO) << "\'" << cl::GetPlatformInfo(platform_id, CL_PLATFORM_NAME)
<< "\' platform has no OpenCL device: " << device_type << " mode";
}
if (this->platform_id == nullptr) {
LOG(WARNING) << "No OpenCL device";
......@@ -273,9 +270,6 @@ void OpenCLWorkspace::Init(const std::string& type_key, const std::string& devic
this->queues.push_back(
clCreateCommandQueue(this->context, did, 0, &err_code));
OPENCL_CHECK_ERROR(err_code);
LOG(INFO) << type_key << "(" << i
<< ")=\'" << cl::GetDeviceInfo(did, CL_DEVICE_NAME)
<< "\' cl_device_id=" << did;
}
}
......
......@@ -30,14 +30,21 @@ def test_split():
cfg = FallbackConfigEntity()
cfg.define_split('tile_n', cfg.axis(128), num_outputs=3)
cfg.fallback_split('tile_n', [-1, 8, 4])
assert cfg['tile_n'].size == [4, 8, 4]
cfg = FallbackConfigEntity()
cfg.define_split('tile_n', cfg.axis(49), num_outputs=3)
cfg.fallback_split('tile_n', [-1, 8, 4])
assert cfg['tile_n'].size == [7, 7, 1]
cfg = FallbackConfigEntity()
cfg.define_split('tile_n', cfg.axis(49), num_outputs=3)
try:
cfg.fallback_split('tile_n', [-1, 1, 0])
assert False
except RuntimeError:
pass
if __name__ == '__main__':
test_split()
......@@ -61,7 +61,6 @@ def test_make_attrs():
datrr = tvm.load_json(tvm.save_json(dattr))
assert dattr.name.value == "xyz"
def test_make_sum():
A = tvm.placeholder((2, 10), name='A')
k = tvm.reduce_axis((0,10), "k")
......
......@@ -34,20 +34,21 @@ def test_target_dispatch():
with tvm.target.create("metal"):
assert mygeneric(1) == 3
assert tvm.target.current_target() == None
assert tvm.target.current_target() is None
def test_target_string_parse():
target = tvm.target.create("cuda -libs=cublas,cudnn")
target = tvm.target.create("cuda -model=unknown -libs=cublas,cudnn")
assert target.target_name == "cuda"
assert target.options == ['-libs=cublas,cudnn']
assert target.options == ['-model=unknown', '-libs=cublas,cudnn']
assert target.keys == ['cuda', 'gpu']
assert target.libs == ['cublas', 'cudnn']
assert str(target) == str(tvm.target.cuda("-libs=cublas,cudnn"))
assert str(target) == str(tvm.target.cuda(options="-libs=cublas,cudnn"))
assert tvm.target.intel_graphics().device_name == "intel_graphics"
assert tvm.target.mali().device_name == "mali"
assert tvm.target.arm_cpu().device_name == "arm_cpu"
if __name__ == "__main__":
test_target_dispatch()
......
......@@ -42,9 +42,24 @@ def decl_spatial_pack(cfg, data, kernel, strides, padding, layout, out_dtype):
"""spatial packing template"""
return _decl_spatial_pack(cfg, data, kernel, strides, padding, layout, out_dtype, num_tile=2)
@autotvm.task.register_topi_schedule(schedule_conv2d_nchw, 'arm_cpu', ['direct', 'winograd'])
@autotvm.register_topi_schedule(schedule_conv2d_nchw, 'arm_cpu', ['direct', 'winograd'])
def schedule_conv2d_nchw_arm_cpu(cfg, outs):
"""TOPI schedule callback"""
"""TOPI schedule callback for conv2d
Parameters
----------
cfg: ConfigEntity
The config for this template
outs: Array of Tensor
The computation graph description of conv2d
in the format of an array of tensors.
Returns
-------
s: Schedule
The computation schedule for conv2d.
"""
s = tvm.create_schedule([x.op for x in outs])
def _callback(op):
......@@ -120,19 +135,16 @@ def _decl_spatial_pack(cfg, data, kernel, strides, padding, layout, out_dtype, n
cfg.define_annotate("ann_reduce", [kh, kw], policy='try_unroll')
cfg.define_annotate("ann_spatial", [vh, vw, vc], policy='try_unroll_vec')
# ====================================================================
# fallback support
if cfg.is_fallback:
if num_tile == 2:
cfg.fallback_split('tile_co', [-1, 8])
cfg.fallback_split('tile_oh', [-1, 2])
cfg.fallback_split('tile_ow', [-1, 8])
else:
cfg.fallback_split('tile_co', [-1, 16, 4])
cfg.fallback_split('tile_oh', [-1, 1, 1])
cfg.fallback_split('tile_ow', [-1, 1, 4])
cfg['ann_reduce'].anns = ['unroll', 'unroll']
cfg['ann_spatial'].anns = ['none', 'unroll', 'vec']
if num_tile == 2: # arm cpu
ref_log = autotvm.tophub.load_reference_log('arm_cpu', 'rk3399', 'conv2d', 'direct')
cfg.fallback_with_reference_log(ref_log)
elif num_tile == 3: # mali gpu
ref_log = autotvm.tophub.load_reference_log('mali', 'rk3399', 'conv2d', 'direct')
cfg.fallback_with_reference_log(ref_log)
# ====================================================================
VC = cfg["tile_co"].size[-1]
VH = cfg["tile_oh"].size[-1]
......@@ -478,8 +490,8 @@ def decl_winograd_ww(cfg, data, kernel, strides, padding, layout, out_dtype, til
tile_size)
@autotvm.task.register_topi_schedule(schedule_conv2d_winograd_without_weight_transform,
'arm_cpu', ['winograd'])
@autotvm.register_topi_schedule(schedule_conv2d_winograd_without_weight_transform,
'arm_cpu', ['winograd'])
def schedule_conv2d_winograd_without_weight_transform_(cfg, outs):
"""TOPI schedule callback"""
s = tvm.create_schedule([x.op for x in outs])
......@@ -517,11 +529,8 @@ def _alter_conv2d_layout(attrs, inputs, tinfos):
layout, out_dtype)
cfg = autotvm.DispatchContext.current.query(tvm.target.current_target(), workload)
if cfg.is_fallback: # if is fallback, clear query cache and return None
context = autotvm.DispatchContext.current
while not isinstance(context, autotvm.FallbackContext):
context = context._old_ctx
context.clear_cache(tvm.target.current_target(), workload)
if cfg.is_fallback: # if is fallback, clear query cache and return None
autotvm.task.clear_fallback_cache(tvm.target.current_target(), workload)
return None
if cfg.template_key == 'direct': # packing weight tensor
......
......@@ -9,11 +9,11 @@ from ..nn import depthwise_conv2d_nchw
from ..util import traverse_inline
# register original implementation of depthwise_conv2d_nchw since we don't need to change this part
autotvm.task.register_topi_compute(depthwise_conv2d_nchw, ['arm_cpu', 'cpu'], 'direct',
depthwise_conv2d_nchw.fdefault)
autotvm.register_topi_compute(depthwise_conv2d_nchw, ['arm_cpu', 'cpu'], 'direct',
depthwise_conv2d_nchw.fdefault)
# register customized schedule for arm cpu.
@autotvm.task.register_topi_schedule(schedule_depthwise_conv2d_nchw, ['arm_cpu', 'cpu'], 'direct')
@autotvm.register_topi_schedule(schedule_depthwise_conv2d_nchw, ['arm_cpu', 'cpu'], 'direct')
def schedule_depthwise_conv2d_nchw_arm(cfg, outs):
"""Schedule depthwise conv2d
......@@ -37,16 +37,19 @@ def schedule_depthwise_conv2d_nchw_arm(cfg, outs):
A, B, C = data, kernel, output
s[data_pad].compute_inline()
# define tile
##### space definition begin #####
n, c, h, w = s[output].op.axis
cfg.define_split('tile_c', c, num_outputs=2)
cfg.define_split('tile_h', h, num_outputs=2)
cfg.define_split('tile_w', w, num_outputs=2)
_, vc = cfg.define_split('tile_c', c, num_outputs=2)
_, vh = cfg.define_split('tile_h', h, num_outputs=2)
_, vw = cfg.define_split('tile_w', w, num_outputs=2)
cfg.define_annotate('ann', [vh, vw, vc], policy='try_unroll_vec')
# fallback support
if cfg.is_fallback:
cfg.fallback_split('tile_c', [-1, 4])
cfg.fallback_split('tile_h', [-1, 2])
cfg.fallback_split('tile_w', [-1, 4])
ref_log = autotvm.tophub.load_reference_log(
'arm_cpu', 'rk3399', 'depthwise_conv2d_nchw', 'direct')
cfg.fallback_with_reference_log(ref_log)
##### space definition end #####
# park data to vector form [n, c, h, w] -> [n, C, h, w, VC]
A0 = s.cache_read(data_pad, "global", C)
......@@ -78,7 +81,6 @@ def schedule_depthwise_conv2d_nchw_arm(cfg, outs):
s[A1].compute_at(s[C0], oh)
# try unroll and vectorization
cfg.define_annotate('ann', [ih, iw, vc], policy='try_unroll_vec')
cfg['ann'].apply(s, C0, [ih, iw, vc],
axis_lens=[cfg['tile_h'].size[-1],
cfg['tile_w'].size[-1],
......
......@@ -2,10 +2,8 @@
"""CUDA specific declaration and schedules."""
from __future__ import absolute_import as _abs
from .conv2d import conv2d_cuda
from .conv2d_nchw import schedule_conv2d_nchw
from . import conv2d, depthwise_conv2d, conv2d_transpose_nchw
from .conv2d_hwcn import schedule_conv2d_hwcn
from .depthwise_conv2d import schedule_depthwise_conv2d_nchw, schedule_depthwise_conv2d_nhwc
from .depthwise_conv2d import schedule_depthwise_conv2d_backward_input_nhwc
from .depthwise_conv2d import schedule_depthwise_conv2d_backward_weight_nhwc
from .reduction import schedule_reduce
......@@ -13,7 +11,6 @@ from .softmax import schedule_softmax
from .injective import schedule_injective, schedule_elemwise, schedule_broadcast
from .dense import dense_cuda, schedule_dense
from .pooling import schedule_pool, schedule_global_pool
from .conv2d_transpose_nchw import schedule_conv2d_transpose_nchw
from .extern import schedule_extern
from .nn import schedule_lrn, schedule_l2_normalize
from .vision import *
......
# pylint: disable=invalid-name, no-member, too-many-locals, too-many-statements, too-many-arguments, too-many-branches, line-too-long
# pylint: disable=invalid-name
"""Compute definition for conv2d with cuda backend"""
import tvm
from tvm import autotvm
from tvm.contrib import cudnn
import topi
from ..nn.conv2d import conv2d
from ..util import get_const_int
@conv2d.register("cuda")
def conv2d_cuda(data, kernel, stride, padding, layout='NCHW', out_dtype='float32'):
from .. import nn, generic
from ..util import get_const_int, get_const_tuple, traverse_inline
from .conv2d_direct import schedule_direct_cuda
from .conv2d_winograd import winograd_cuda, schedule_winograd_cuda
@autotvm.register_topi_compute(nn.conv2d, ['cuda', 'gpu'], ['direct', 'winograd'])
def conv2d_cuda(cfg, data, kernel, strides, padding, layout='NCHW', out_dtype='float32'):
"""Conv2D operator for cuda backend.
Parameters
----------
input : tvm.Tensor
cfg: ConfigEntity
The config for this template
data : tvm.Tensor
4-D with shape [batch, in_channel, in_height, in_width]
filter : tvm.Tensor
kernel : tvm.Tensor
4-D with shape [num_filter, in_channel, filter_height, filter_width]
stride : int or a list/tuple of two ints
strides : int or a list/tuple of two ints
stride size, or [stride_height, stride_width]
padding : int or a list/tuple of two ints
......@@ -27,45 +35,56 @@ def conv2d_cuda(data, kernel, stride, padding, layout='NCHW', out_dtype='float32
layout : str
layout of data
out_dtype: str
The output type. This is used for mixed precision.
Returns
-------
output : tvm.Tensor
4-D with shape [batch, out_channel, out_height, out_width]
"""
assert isinstance(stride, int) or len(stride) == 2
if isinstance(stride, int):
stride_h = stride_w = stride
else:
stride_h, stride_w = stride
if isinstance(padding, int):
pad_h = pad_w = padding
else:
pad_h, pad_w = padding
# handle dilation
dilation_h = dilation_w = 1
kernel_tvm = kernel
kernel_cudnn = kernel
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
kernel_before_dilation = kernel.op.input_tensors[0]
kernel_cudnn = kernel_before_dilation
if layout == 'NCHW':
dilation_h = (get_const_int(kernel.shape[2]) + get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
dilation_w = (get_const_int(kernel.shape[3]) + get_const_int(kernel_before_dilation.shape[3]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
elif layout == 'NHWC':
dilation_h = (get_const_int(kernel.shape[1]) + get_const_int(kernel_before_dilation.shape[1]) - 1) \
// get_const_int(kernel_before_dilation.shape[1])
dilation_w = (get_const_int(kernel.shape[2]) + get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
target = tvm.target.current_target()
if "cudnn" in target.libs:
assert layout != 'HWCN', "HWCN layout not supported with CUDNN."
tensor_format = 0 # CUDNN_TENSOR_NCHW
if layout == 'NHWC':
if layout == 'NCHW':
tensor_format = 0 # CUDNN_TENSOR_NCHW
N, _, H, W = get_const_tuple(data.shape)
elif layout == 'NHWC':
tensor_format = 1 # CUDNN_TENSOR_NHWC
N, H, W, _ = get_const_tuple(data.shape)
else:
raise ValueError("Unsupported layout %s in cudnn" % layout)
CO, CI, KH, KW = get_const_tuple(kernel.shape)
# handle dilation
stride_h, stride_w = (strides, strides) if isinstance(strides, int) else strides
pad_h, pad_w = (padding, padding) if isinstance(padding, int) else padding
OH = (H + 2 * pad_h - KH) // stride_h + 1
OW = (W + 2 * pad_w - KW) // stride_w + 1
cfg.add_flop(2 * N * OH * OW * CO * CI * KH * KW)
dilation_h = dilation_w = 1
kernel_before_dilation = kernel
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
kernel_before_dilation = kernel.op.input_tensors[0]
if layout == 'NCHW':
dilation_h = (get_const_int(kernel.shape[2]) +
get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
dilation_w = (get_const_int(kernel.shape[3]) +
get_const_int(kernel_before_dilation.shape[3]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
elif layout == 'NHWC':
dilation_h = (get_const_int(kernel.shape[1]) +
get_const_int(kernel_before_dilation.shape[1]) - 1) \
// get_const_int(kernel_before_dilation.shape[1])
dilation_w = (get_const_int(kernel.shape[2]) +
get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
return cudnn.conv2d_forward(data,
kernel_cudnn,
kernel_before_dilation,
stride_h,
stride_w,
pad_h,
......@@ -74,10 +93,51 @@ def conv2d_cuda(data, kernel, stride, padding, layout='NCHW', out_dtype='float32
dilation_w,
conv_mode=1,
tensor_format=tensor_format,
algo=-1) # let CUDNN choose the best algo
elif layout == 'NCHW':
return topi.nn.conv2d_nchw(data, kernel_tvm, stride, padding, out_dtype)
algo=-1) # let CUDNN choose the best algo
if cfg.template_key == 'winograd':
return winograd_cuda(cfg, data, kernel, strides, padding, layout, out_dtype,
pre_computed=False)
if layout == 'NCHW':
return nn.conv2d_nchw(data, kernel, strides, padding, out_dtype)
elif layout == 'HWCN':
return topi.nn.conv2d_hwcn(data, kernel_tvm, stride, padding, out_dtype)
return nn.conv2d_hwcn(data, kernel, strides, padding, out_dtype)
else:
raise ValueError("not support this layout {} yet".format(layout))
@autotvm.register_topi_schedule(generic.schedule_conv2d_nchw, ["cuda", "gpu"],
["direct", 'winograd'])
def schedule_conv2d_nchw_cuda(cfg, outs):
"""TOPI schedule callback of conv2d for cuda gpu
Parameters
----------
cfg: ConfigEntity
The config for this template
outs: Array of Tensor
The computation graph description of conv2d
in the format of an array of tensors.
Returns
-------
s: Schedule
The computation schedule for conv2d.
"""
target = tvm.target.current_target()
if 'cudnn' in target.libs:
return generic.schedule_extern(outs)
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
s = tvm.create_schedule([x.op for x in outs])
def _callback(op):
if op.tag == 'conv2d_nchw':
schedule_direct_cuda(cfg, s, op.output(0))
if op.tag == 'conv2d_nchw_winograd':
schedule_winograd_cuda(cfg, s, op.output(0), pre_computed=False)
traverse_inline(s, outs[0].op, _callback)
return s
# pylint: disable=invalid-name
"""The templates for cuda conv2d operators"""
import tvm
from tvm import autotvm
def schedule_direct_cuda(cfg, s, conv):
"""schedule optimized for batch size = 1"""
##### space definition begin #####
n, f, y, x = s[conv].op.axis
rc, ry, rx = s[conv].op.reduce_axis
cfg.define_split("tile_f", f, num_outputs=4)
cfg.define_split("tile_y", y, num_outputs=4)
cfg.define_split("tile_x", x, num_outputs=4)
cfg.define_split("tile_rc", rc, num_outputs=2)
cfg.define_split("tile_ry", ry, num_outputs=2)
cfg.define_split("tile_rx", rx, num_outputs=2)
cfg.define_knob("auto_unroll_max_step", [0, 512, 1500])
target = tvm.target.current_target()
if target.target_name in ['nvptx', 'rocm']:
cfg.define_knob("unroll_explicit", [1])
else:
cfg.define_knob("unroll_explicit", [0, 1])
# fallback support
if cfg.is_fallback:
ref_log = autotvm.tophub.load_reference_log(
target.target_name, target.model, 'conv2d', 'direct')
cfg.fallback_with_reference_log(ref_log)
##### space definition end #####
pad_data, kernel = s[conv].op.input_tensors
s[pad_data].compute_inline()
if isinstance(kernel.op, tvm.tensor.ComputeOp) and 'dilate' in kernel.op.tag:
s[kernel].compute_inline()
if conv.op in s.outputs:
output = conv
OL = s.cache_write(conv, 'local')
else:
output = s.outputs[0].output(0)
s[conv].set_scope('local')
OL = conv
# create cache stage
AA = s.cache_read(pad_data, 'shared', [OL])
WW = s.cache_read(kernel, 'shared', [OL])
# tile and bind spatial axes
n, f, y, x = s[output].op.axis
kernel_scope, n = s[output].split(n, nparts=1)
bf, vf, tf, fi = cfg["tile_f"].apply(s, output, f)
by, vy, ty, yi = cfg["tile_y"].apply(s, output, y)
bx, vx, tx, xi = cfg["tile_x"].apply(s, output, x)
bf = s[output].fuse(n, bf)
s[output].bind(bf, tvm.thread_axis("blockIdx.z"))
s[output].bind(by, tvm.thread_axis("blockIdx.y"))
s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
s[output].bind(vf, tvm.thread_axis("vthread"))
s[output].bind(vy, tvm.thread_axis("vthread"))
s[output].bind(vx, tvm.thread_axis("vthread"))
s[output].bind(tf, tvm.thread_axis("threadIdx.z"))
s[output].bind(ty, tvm.thread_axis("threadIdx.y"))
s[output].bind(tx, tvm.thread_axis("threadIdx.x"))
s[output].reorder(bf, by, bx, vf, vy, vx, tf, ty, tx, fi, yi, xi)
s[OL].compute_at(s[output], tx)
# tile reduction axes
n, f, y, x = s[OL].op.axis
rc, ry, rx = s[OL].op.reduce_axis
rco, rci = cfg['tile_rc'].apply(s, OL, rc)
ryo, ryi = cfg['tile_rx'].apply(s, OL, ry)
rxo, rxi = cfg['tile_ry'].apply(s, OL, rx)
s[OL].reorder(rco, ryo, rxo, rci, ryi, rxi, n, f, y, x)
s[AA].compute_at(s[OL], rxo)
s[WW].compute_at(s[OL], rxo)
# cooperative fetching
for load in [AA, WW]:
n, f, y, x = s[load].op.axis
fused = s[load].fuse(n, f, y, x)
tz, fused = s[load].split(fused, nparts=cfg["tile_f"].size[2])
ty, fused = s[load].split(fused, nparts=cfg["tile_y"].size[2])
tx, fused = s[load].split(fused, nparts=cfg["tile_x"].size[2])
s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
s[load].bind(tx, tvm.thread_axis("threadIdx.x"))
# unroll
s[output].pragma(kernel_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
s[output].pragma(kernel_scope, 'unroll_explicit', cfg['unroll_explicit'].val)
#pylint: disable=invalid-name, no-member, too-many-locals, too-many-statements, too-many-arguments, too-many-branches, line-too-long
"""Schedule for conv2d_nchw with auto fusion"""
import tvm
import topi
from .. import util
from .. import tag
from .. import generic
def conv2d_224_3_64(s, temp, temp_R, temp_S, Filter_S, Out, Out_L, flag):
"""Schedule conv2d for specific feature_in_out_filter pattern"""
# scheduler params
ofactor = 16
hfactor = 2
if flag >= 96:
hfactor = 4
max_threads = int(tvm.target.current_target(allow_none=False).max_num_threads)
ow_size = util.get_const_int(Out.shape[3])
num_thread = min(max_threads, ow_size * hfactor)
vthread = ofactor
block_x = tvm.thread_axis("blockIdx.x")
thread_x = tvm.thread_axis((0, num_thread), "threadIdx.x")
thread_xz = tvm.thread_axis((0, vthread), "vthread", name="vx")
i, oc, h, w = s[Out].op.axis
if ow_size * hfactor == num_thread:
ooc, ioc = s[Out].split(oc, factor=vthread)
oh, ih = s[Out].split(h, factor=hfactor)
s[Out].reorder(ooc, oh, ioc, ih, w)
oc = s[Out].fuse(ooc, oh)
ow, _ = s[Out].split(w, nparts=ow_size)
w = s[Out].fuse(ow, ih)
s[Out].bind(w, thread_x)
s[Out].bind(ioc, thread_xz)
s[Out].bind(oc, block_x)
else:
ow, w = s[Out].split(w, factor=num_thread)
s[Out].bind(w, thread_x)
s[Out].bind(ow, block_x)
s[Out_L].compute_at(s[Out], w)
# schedule Out_L local write
i, oc, h, w = s[Out_L].op.axis
ic, dh, dw = s[Out_L].op.reduce_axis
s[Out_L].reorder(i, oc, h, w, ic, dh, dw)
s[temp_S].compute_at(s[Out_L], ic)
s[Filter_S].compute_at(s[Out_L], w)
num_thread1 = max_threads
thread_xx = tvm.thread_axis((0, num_thread1), "threadIdx.x")
block_xx = tvm.thread_axis("blockIdx.x")
i = s[temp].fuse(*s[temp].op.axis)
bx, tx = s[temp].split(i, factor=num_thread1)
s[temp].bind(tx, thread_xx)
s[temp].bind(bx, block_xx)
i = s[temp_R].fuse(*s[temp_R].op.axis)
bx, tx = s[temp_R].split(i, factor=num_thread1)
s[temp_R].bind(tx, thread_xx)
s[temp_R].bind(bx, block_xx)
#schedule temp_S shared mem load
i, ic, h, ow, iw = s[temp_S].op.axis
h = s[temp_S].fuse(h, ow)
_, tx = s[temp_S].split(h, factor=num_thread)
s[temp_S].bind(tx, thread_x)
if num_thread < max_threads:
s[temp_S].vectorize(iw)
#schedule Filter_S shared mem load
i, oc, h, w = s[Filter_S].op.axis
fuse_index = s[Filter_S].fuse(w, h)
w = s[Filter_S].fuse(fuse_index, oc)
tx, _ = s[Filter_S].split(w, nparts=num_thread)
s[Filter_S].bind(tx, thread_x)
def conv2d_56_64_128(s, temp, temp_R, temp_S, Filter_S, Out, Out_L, flag):
"""Schedule conv2d for specific feature_in_out_filter pattern"""
if util.get_const_int(Filter_S.shape[0]) == util.get_const_int(Filter_S.shape[1]):
mark = util.get_const_int(Out.shape[2]) * util.get_const_int(Out.shape[3])
num_thread_x = 0
if mark % 8 == 0 and mark % 7 == 0:
num_thread_x = 8
vthread_x = 7
elif mark % 4 == 0 and mark % 7 == 0:
num_thread_x = 4
vthread_x = 7
else:
for i in range(5, mark):
if mark % i == 0 and num_thread_x == 0:
vthread_x = i
mark = mark // i
if mark % i == 0 and vthread_x > 0:
num_thread_x = i
break
if mark < 5 or num_thread_x * vthread_x > 128:
num_thread_x = 8
vthread_x = 8
num_thread_y = 8
vthread_y = 2
ifactor = 8
block_x = tvm.thread_axis("blockIdx.x")
block_y = tvm.thread_axis("blockIdx.y")
thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")
thread_y = tvm.thread_axis((0, num_thread_y), "threadIdx.y")
thread_xz = tvm.thread_axis((0, vthread_x), "vthread", name="vx")
thread_yz = tvm.thread_axis((0, vthread_y), "vthread", name="vy")
i, oc, h, w = s[Out].op.axis
factor = util.get_const_int(Out.shape[3])
ooc, ioc = s[Out].split(oc, factor=num_thread_y*vthread_y)
oioc, iioc = s[Out].split(ioc, nparts=vthread_y)
s[Out].bind(iioc, thread_y)
s[Out].bind(oioc, thread_yz)
s[Out].bind(ooc, block_y)
if factor < num_thread_x*vthread_x:
oh, ih = s[Out].split(h, factor=num_thread_x*vthread_x//factor)
w = s[Out].fuse(ih, w)
ow, iw = s[Out].split(w, nparts=vthread_x)
s[Out].reorder(i, ooc, oh, oioc, ow, iioc, iw)
s[Out].bind(iw, thread_x)
s[Out].bind(ow, thread_xz)
s[Out].bind(oh, block_x)
s[Out_L].compute_at(s[Out], iw)
else:
ow, iw = s[Out].split(w, factor=num_thread_x)
oh, ih = s[Out].split(h, factor=vthread_x)
s[Out].reorder(i, ooc, oh, ow, oioc, ih, iioc, iw)
oh = s[Out].fuse(oh, ow)
s[Out].bind(iw, thread_x)
s[Out].bind(ih, thread_xz)
s[Out].bind(oh, block_x)
s[Out_L].compute_at(s[Out], iw)
# schedule Out_L local write
i, oc, h, w = s[Out_L].op.axis
ic, dh, dw = s[Out_L].op.reduce_axis
oic, iic = s[Out_L].split(ic, factor=ifactor)
s[Out_L].reorder(oic, dh, dw, iic, h, w)
s[temp_S].compute_at(s[Out_L], oic)
s[Filter_S].compute_at(s[Out_L], dw)
num_thread = tvm.target.current_target(allow_none=False).max_num_threads
thread_xx = tvm.thread_axis((0, num_thread), "threadIdx.x")
block_xx = tvm.thread_axis("blockIdx.x")
i = s[temp].fuse(*s[temp].op.axis)
bx, tx = s[temp].split(i, factor=num_thread)
s[temp].bind(tx, thread_xx)
s[temp].bind(bx, block_xx)
i = s[temp_R].fuse(*s[temp_R].op.axis)
bx, tx = s[temp_R].split(i, factor=num_thread)
s[temp_R].bind(tx, thread_xx)
s[temp_R].bind(bx, block_xx)
#schedule temp_S shared mem load
i, oic, h, w, iic = s[temp_S].op.axis
oic = s[temp_S].fuse(oic, h, w)
ooic, ioic = s[temp_S].split(oic, factor=num_thread_x)
_, iooic = s[temp_S].split(ooic, factor=num_thread_y)
s[temp_S].bind(ioic, thread_x)
s[temp_S].bind(iooic, thread_y)
s[temp_S].vectorize(iic)
i, oc, h, w = s[Filter_S].op.axis
_, ioc = s[Filter_S].split(oc, factor=num_thread_y)
_, ii = s[Filter_S].split(i, factor=num_thread_x)
s[Filter_S].bind(ioc, thread_y)
s[Filter_S].bind(ii, thread_x)
else:
# scheduler params
vthread = 2
opart2 = 4
ofactor = 64
wfactor = 28
ifactor = 8
if flag > 256:
wfactor = 14
num_thread_x = max(1, ofactor//(opart2*2))
num_thread_y = max(1, (wfactor + vthread-1) // vthread)
block_x = tvm.thread_axis("blockIdx.x")
block_y = tvm.thread_axis("blockIdx.y")
block_z = tvm.thread_axis("blockIdx.z")
thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")
thread_y = tvm.thread_axis((0, num_thread_y), "threadIdx.y")
thread_xz = tvm.thread_axis((0, vthread), "vthread", name="vx")
thread_yz = tvm.thread_axis((0, vthread), "vthread", name="vy")
i, oc, h, w = s[Out].op.axis
ooc, ioc = s[Out].split(oc, factor=ofactor)
ow, iw = s[Out].split(w, factor=wfactor)
ow = s[Out].fuse(ow, h)
oioc, iioc = s[Out].split(ioc, nparts=vthread)
oiw, iiw = s[Out].split(iw, nparts=vthread)
oiioc, iiioc = s[Out].split(iioc, nparts=opart2)
s[Out].reorder(i, ooc, ow, oioc, oiw, oiioc, iiw, iiioc)
s[Out].bind(iiioc, thread_x)
s[Out].bind(iiw, thread_y)
s[Out].bind(oiioc, thread_xz)
s[Out].bind(oiw, thread_yz)
s[Out].bind(oioc, block_x)
s[Out].bind(ow, block_y)
s[Out].bind(ooc, block_z)
s[Out_L].compute_at(s[Out], iiioc)
# schedule Out_L local write
i, oc, h, w = s[Out_L].op.axis
ic, dh, dw = s[Out_L].op.reduce_axis
oic, iic = s[Out_L].split(ic, factor=ifactor)
s[Out_L].reorder(oic, dh, dw, iic, h, w)
max_num_thread = tvm.target.current_target(allow_none=False).max_num_threads
if util.get_const_int(Filter_S.shape[1]) == 128:
oic = s[Out_L].fuse(dh, oic)
s[temp_S].compute_at(s[Out_L], oic)
s[Filter_S].compute_at(s[Out_L], oic)
num_thread = max_num_thread
else:
s[temp_S].compute_at(s[Out_L], oic)
s[Filter_S].compute_at(s[Out_L], dw)
num_thread = 456
if max_num_thread < num_thread:
num_thread = max_num_thread
thread_xx = tvm.thread_axis((0, num_thread), "threadIdx.x")
block_xx = tvm.thread_axis("blockIdx.x")
i = s[temp].fuse(*s[temp].op.axis)
bx, tx = s[temp].split(i, factor=num_thread)
s[temp].bind(tx, thread_xx)
s[temp].bind(bx, block_xx)
i = s[temp_R].fuse(*s[temp_R].op.axis)
bx, tx = s[temp_R].split(i, factor=num_thread)
s[temp_R].bind(tx, thread_xx)
s[temp_R].bind(bx, block_xx)
#schedule temp_S shared mem load
i, oic, h, w, iic = s[temp_S].op.axis
oic = s[temp_S].fuse(oic, h, w)
ooic, ioic = s[temp_S].split(oic, factor=num_thread_x)
_, iooic = s[temp_S].split(ooic, factor=num_thread_y)
s[temp_S].bind(ioic, thread_x)
s[temp_S].bind(iooic, thread_y)
s[temp_S].vectorize(iic)
#schedule Filter_S shared mem load
i, oc, h, w = s[Filter_S].op.axis
_, ioc = s[Filter_S].split(oc, factor=num_thread_x)
_, ii = s[Filter_S].split(i, factor=num_thread_y)
s[Filter_S].bind(ioc, thread_x)
s[Filter_S].bind(ii, thread_y)
def conv2d_14_256_256(s, temp, temp_R, temp_S, Filter, Filter_S, Out, Out_L):
"""Schedule conv2d for specific feature_in_out_filter pattern"""
max_threads = int(tvm.target.current_target(allow_none=False).max_num_threads)
if util.get_const_int(Filter.shape[0]) + util.get_const_int(Filter.shape[1]) <= 768:
# scheduler params
vthread_x = util.get_const_int(Out.shape[3])
num_thread_x = 64
ofactor = 8
if util.get_const_int(Filter.shape[3]) == 1 and vthread_x * 5 <= max_threads:
ofactor = 64
block_x = tvm.thread_axis("blockIdx.x")
thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")
thread_xz = tvm.thread_axis((0, vthread_x), "vthread", name="vx")
i, oc, h, w = s[Out].op.axis
ooc, ioc = s[Out].split(oc, factor=num_thread_x)
s[Out].reorder(i, ooc, h, w, ioc)
ooc = s[Out].fuse(h, ooc)
s[Out].bind(ioc, thread_x)
s[Out].bind(w, thread_xz)
s[Out].bind(ooc, block_x)
s[Out_L].compute_at(s[Out], ioc)
# schedule Out_L local write
i, oc, h, w = s[Out_L].op.axis
ic, dh, dw = s[Out_L].op.reduce_axis
oic, iic = s[Out_L].split(ic, ofactor)
s[Out_L].reorder(oic, dh, dw, iic, h, w)
s[temp_S].compute_at(s[Out_L], oic)
s[Filter_S].compute_at(s[Out_L], oic)
#schedule temp_S shared mem load
i, ic, h, w = s[temp_S].op.axis
s[temp_S].reorder(i, ic, w, h)
ic = s[temp_S].fuse(w, ic)
_, iic = s[temp_S].split(ic, factor=num_thread_x)
s[temp_S].bind(iic, thread_x)
#schedule Filter_S shared mem load
i, oc, h, w = s[Filter_S].op.axis
_, ii = s[Filter_S].split(i, factor=num_thread_x)
s[Filter_S].bind(ii, thread_x)
s[Filter_S].storage_align(s[Filter_S].op.axis[0], 2, 1)
else:
# scheduler params
vthread_x = min(8, util.get_const_int(Out.shape[2]))
num_thread_x = 16
num_thread_y = min(max_threads // num_thread_x, util.get_const_int(Out.shape[3]))
ofactor = 8
block_x = tvm.thread_axis("blockIdx.x")
thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")
thread_y = tvm.thread_axis((0, num_thread_y), "threadIdx.y")
thread_xz = tvm.thread_axis((0, vthread_x), "vthread", name="vx")
i, oc, h, w = s[Out].op.axis
ow, iw = s[Out].split(w, factor=num_thread_y)
oh, ih = s[Out].split(h, factor=vthread_x)
ooc, ioc = s[Out].split(oc, factor=num_thread_x)
s[Out].reorder(i, ooc, oh, ih, ow, iw, ioc)
s[Out].bind(ioc, thread_x)
s[Out].bind(iw, thread_y)
s[Out].bind(ih, thread_xz)
s[Out].bind(ooc, block_x)
s[Out_L].compute_at(s[Out], ioc)
# schedule Out_L local write
i, oc, h, w = s[Out_L].op.axis
ic, dh, dw = s[Out_L].op.reduce_axis
oic, iic = s[Out_L].split(ic, ofactor)
s[Out_L].reorder(oic, dh, dw, iic, h, w)
s[temp_S].compute_at(s[Out_L], oic)
s[Filter_S].compute_at(s[Out_L], oic)
num_thread = max_threads
thread_xx = tvm.thread_axis((0, num_thread), "threadIdx.x")
block_xx = tvm.thread_axis("blockIdx.x")
i = s[temp].fuse(*s[temp].op.axis)
bx, tx = s[temp].split(i, factor=num_thread)
s[temp].bind(tx, thread_xx)
s[temp].bind(bx, block_xx)
i = s[temp_R].fuse(*s[temp_R].op.axis)
bx, tx = s[temp_R].split(i, factor=num_thread)
s[temp_R].bind(tx, thread_xx)
s[temp_R].bind(bx, block_xx)
#schedule temp_S shared mem load
i, h, w, oc, ic = s[temp_S].op.axis
icc = s[temp_S].fuse(oc, w, h)
oic, iic = s[temp_S].split(icc, factor=num_thread_x)
_, ioic = s[temp_S].split(oic, factor=num_thread_y)
s[temp_S].bind(iic, thread_x)
s[temp_S].bind(ioic, thread_y)
s[temp_S].vectorize(ic)
#schedule Filter_S shared mem load
i, oc, h, w = s[Filter_S].op.axis
_, ii = s[Filter_S].split(i, factor=num_thread_x)
h = s[Filter_S].fuse(h, w)
_, ih = s[Filter_S].split(h, factor=num_thread_y)
s[Filter_S].bind(ii, thread_x)
s[Filter_S].bind(ih, thread_y)
s[Filter_S].storage_align(s[Filter_S].op.axis[0], 2, 1)
def conv2d_56_64_64(s, Filter, temp_S, Filter_S, Out, Out_L):
"""Schedule conv2d for specific feature_in_out_filter pattern"""
# scheduler params
num_thread = 8
vthread = 2
opart2 = 4
ofactor = 64
wfactor = 56
ifactor = 8
if util.get_const_int(Filter.shape[0]) == 64:
opart2 = 8
ifactor = 16
if util.get_const_int(Out.shape[2]) == 224:
num_thread = 4
wfactor = 112
ifactor = 4
sfactor = max(1, ofactor // (opart2*vthread))
spart = max(1, (wfactor + vthread-1) // vthread)
block_x = tvm.thread_axis("blockIdx.x")
block_y = tvm.thread_axis("blockIdx.y")
block_z = tvm.thread_axis("blockIdx.z")
thread_x = tvm.thread_axis((0, num_thread), "threadIdx.x")
thread_y = tvm.thread_axis((0, wfactor // vthread), "threadIdx.y")
thread_xz = tvm.thread_axis((0, opart2), "vthread", name="vx")
thread_yz = tvm.thread_axis((0, vthread), "vthread", name="vy")
i, oc, h, w = s[Out].op.axis
ooc, ioc = s[Out].split(oc, factor=ofactor)
ow, iw = s[Out].split(w, factor=wfactor)
ow = s[Out].fuse(ow, h)
oioc, iioc = s[Out].split(ioc, nparts=vthread)
oiw, iiw = s[Out].split(iw, nparts=vthread)
oiioc, iiioc = s[Out].split(iioc, nparts=opart2)
s[Out].reorder(i, ooc, ow, oioc, oiw, oiioc, iiw, iiioc)
s[Out].bind(iiioc, thread_x)
s[Out].bind(iiw, thread_y)
s[Out].bind(oiioc, thread_xz)
s[Out].bind(oiw, thread_yz)
s[Out].bind(oioc, block_x)
s[Out].bind(ow, block_y)
s[Out].bind(ooc, block_z)
s[Out_L].compute_at(s[Out], iiioc)
# schedule Out_L local write
i, oc, h, w = s[Out_L].op.axis
ic, dh, dw = s[Out_L].op.reduce_axis
oic, iic = s[Out_L].split(ic, factor=ifactor)
s[Out_L].reorder(oic, dh, dw, iic, h, w)
fuse_index = s[Out_L].fuse(dw, dh)
fuse_index = s[Out_L].fuse(fuse_index, oic)
dw = fuse_index
s[temp_S].compute_at(s[Out_L], dw)
s[Filter_S].compute_at(s[Out_L], dw)
#schedule temp_S shared mem load
i, ic, h, w = s[temp_S].op.axis
_, iic = s[temp_S].split(ic, factor=sfactor)
_, iw = s[temp_S].split(w, factor=spart)
s[temp_S].bind(iic, thread_x)
s[temp_S].bind(iw, thread_y)
#schedule Filter_S shared mem load
i, oc, h, w = s[Filter_S].op.axis
_, ioc = s[Filter_S].split(oc, factor=sfactor)
_, ii = s[Filter_S].split(i, factor=spart)
s[Filter_S].bind(ioc, thread_x)
s[Filter_S].bind(ii, thread_y)
def schedule_conv2d_small_batch(outs):
"""Create schedule for tensors or return error if batch size is larger than 1"""
s = tvm.create_schedule([x.op for x in outs])
def schedule(temp, Filter, Output):
"""Schedule conv2d_nchw"""
flag = util.get_const_int(Filter.shape[0])+util.get_const_int(Filter.shape[1])
if flag > 768:
temp_G = s.cache_read(temp, "global", [Output])
s[temp_G].compute_inline()
i, ic, h, w = s[temp_G].op.axis
oic, iic = s[temp_G].split(ic, factor=4)
s[temp_G].reorder(i, h, w, oic, iic)
temp_R = s.cache_write(temp_G, "global")
temp_S = s.cache_read(temp_R, "shared", [temp_G])
elif 128 < flag < 512:
temp_G = s.cache_read(temp, "global", [Output])
s[temp_G].compute_inline()
i, ic, h, w = s[temp_G].op.axis
oic, iic = s[temp_G].split(ic, factor=4)
s[temp_G].reorder(i, oic, h, w, iic)
temp_R = s.cache_write(temp_G, "global")
temp_S = s.cache_read(temp_R, "shared", [temp_G])
elif util.get_const_int(Filter.shape[3]) == 7 or (util.get_const_int(Output.shape[2] == 224) and flag < 128):
temp_G = s.cache_read(temp, "global", [Output])
s[temp_G].compute_inline()
i, ic, h, w = s[temp_G].op.axis
s[temp_G].split(w, factor=4)
temp_R = s.cache_write(temp_G, "global")
temp_S = s.cache_read(temp_R, "shared", [temp_G])
else:
s[temp].compute_inline()
temp_S = s.cache_read(temp, "shared", [Output])
temp_R = temp_S
Filter_S = s.cache_read(Filter, "shared", [Output])
if Output.op in s.outputs:
Out = Output
Out_L = s.cache_write(Out, "local")
else:
Out = outs[0].op.output(0)
s[Output].set_scope("local")
Out_L = Output
if util.get_const_int(Filter.shape[3]) == 7 or (util.get_const_int(Output.shape[2] == 224) and flag < 128):
conv2d_224_3_64(s, temp, temp_R, temp_S, Filter_S, Out, Out_L, flag)
elif 128 < flag < 512:
conv2d_56_64_128(s, temp, temp_R, temp_S, Filter_S, Out, Out_L, flag)
elif flag >= 512:
conv2d_14_256_256(s, temp, temp_R, temp_S, Filter, Filter_S, Out, Out_L)
else:
conv2d_56_64_64(s, Filter, temp_S, Filter_S, Out, Out_L)
scheduled_ops = []
def traverse(OP):
"""Traverse operators from computation graph"""
# inline all one-to-one-mapping operators except the last stage (output)
if tag.is_broadcast(OP.tag):
if OP not in s.outputs:
s[OP].compute_inline()
for tensor in OP.input_tensors:
if tensor.op.input_tensors and tensor.op not in scheduled_ops:
traverse(tensor.op)
# schedule conv2d
if 'conv2d_nchw' in OP.tag:
temp = OP.input_tensors[0]
Filter = OP.input_tensors[1]
if isinstance(Filter.op, tvm.tensor.ComputeOp) and 'dilate' in Filter.op.tag:
s[Filter].compute_inline()
Output = OP.output(0)
schedule(temp, Filter, Output)
scheduled_ops.append(OP)
traverse(outs[0].op)
return s
@generic.schedule_conv2d_nchw.register(["cuda", "gpu"])
def schedule_conv2d_nchw(outs):
"""Schedule for conv2d_nchw.
Parameters
----------
outs: Array of Tensor
The computation graph description of conv2d_nchw
in the format of an array of tensors.
Returns
-------
s: Schedule
The computation schedule for conv2d_nchw.
"""
target = tvm.target.current_target()
if target.target_name == "cuda" and "cudnn" in target.libs:
return topi.generic.schedule_extern(outs)
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
batch_size = util.get_const_int(outs[0].op.output(0).shape[0])
if batch_size > 1:
raise RuntimeError("Batch size: %d is too large for this schedule" % batch_size)
return schedule_conv2d_small_batch(outs)
#pylint: disable=invalid-name, line-too-long
"""Schedule for conv2d_transpose_nchw with auto fusion"""
import tvm
from .. import util
from .. import tag
from .. import generic
from .conv2d_nchw import conv2d_224_3_64, conv2d_56_64_128, conv2d_14_256_256, conv2d_56_64_64
# pylint: disable=invalid-name
"""Conv2d transpose template for cuda backend"""
import tvm
from tvm import autotvm
def schedule_conv2d_transpose_small_batch(outs):
"""Create schedule for tensors or return error if batch size is larger than 1"""
s = tvm.create_schedule([x.op for x in outs])
from .. import nn, generic
from ..util import equal_const_int, get_const_tuple, traverse_inline
def schedule(temp, Filter, Output):
"""Schedule conv2d_transpose_nchw"""
block_h = util.get_const_int(Output.shape[3])
block_w = util.get_const_int(temp.shape[1])
if block_h % 48 == 0:
block_h = 48
elif block_h % 32 == 0:
block_h = 32
if block_w % 48 == 0:
block_w = 48
elif block_w % 32 == 0:
block_w = 32
flag = util.get_const_int(Filter.shape[0])+util.get_const_int(Filter.shape[1])
if flag > 768:
temp_G = s.cache_read(temp, "global", [Output])
s[temp_G].compute_inline()
i, ic, h, w = s[temp_G].op.axis
oic, iic = s[temp_G].split(ic, factor=4)
s[temp_G].reorder(i, h, w, oic, iic)
temp_R = s.cache_write(temp_G, "global")
temp_S = s.cache_read(temp_R, "shared", [temp_G])
elif 128 < flag < 512:
temp_G = s.cache_read(temp, "global", [Output])
s[temp_G].compute_inline()
i, ic, h, w = s[temp_G].op.axis
oic, iic = s[temp_G].split(ic, factor=4)
s[temp_G].reorder(i, oic, h, w, iic)
temp_R = s.cache_write(temp_G, "global")
temp_S = s.cache_read(temp_R, "shared", [temp_G])
elif util.get_const_int(Filter.shape[3]) == 7 or (util.get_const_int(Output.shape[2] == 224) and flag < 128):
temp_G = s.cache_read(temp, "global", [Output])
s[temp_G].compute_inline()
i, ic, h, w = s[temp_G].op.axis
s[temp_G].split(w, factor=4)
temp_R = s.cache_write(temp_G, "global")
temp_S = s.cache_read(temp_R, "shared", [temp_G])
else:
s[temp].compute_inline()
temp_S = s.cache_read(temp, "shared", [Output])
temp_R = temp_S
Filter_S = s.cache_read(Filter, "shared", [Output])
if Output.op in s.outputs:
Out = Output
Out_L = s.cache_write(Out, "local")
else:
Out = outs[0].op.output(0)
s[Output].set_scope("local")
Out_L = Output
if util.get_const_int(Filter.shape[3]) == 7 or (util.get_const_int(Output.shape[2] == 224) and flag < 128):
conv2d_224_3_64(s, temp, temp_R, temp_S, Filter_S, Out, Out_L, flag)
elif 128 < flag < 512:
conv2d_56_64_128(s, temp, temp_R, temp_S, Filter_S, Out, Out_L, flag)
elif flag >= 512:
conv2d_14_256_256(s, temp, temp_R, temp_S, Filter, Filter_S, Out, Out_L)
else:
conv2d_56_64_64(s, Filter, temp_S, Filter_S, Out, Out_L)
scheduled_ops = []
def traverse(OP):
"""Internal travserse function"""
# inline all one-to-one-mapping operators except the last stage (output)
if tag.is_injective(OP.tag):
if OP not in s.outputs:
s[OP].compute_inline()
for tensor in OP.input_tensors:
if tensor.op.input_tensors and tensor.op not in scheduled_ops:
traverse(tensor.op)
# schedule conv2d_transpose_nchw
if 'conv2d_transpose_nchw' in OP.tag:
temp = OP.input_tensors[0]
DilatedInput = temp.op.input_tensors[0]
s[DilatedInput].compute_inline()
Filter = OP.input_tensors[1]
Output = OP.output(0)
schedule(temp, Filter, Output)
scheduled_ops.append(OP)
traverse(outs[0].op)
return s
@autotvm.task.register_topi_compute(nn.conv2d_transpose_nchw, ['cuda', 'gpu'], "direct")
def conv2d_transpose_nchw_cuda(cfg, Input, Filter, strides, padding, out_dtype):
"""Transposed 2D convolution nchw forward operator.
Parameters
----------
cfg: ConfigEntity
The config for this template
Input : tvm.Tensor
4-D with shape [batch, in_channel, in_height, in_width]
Filter : tvm.Tensor
4-D with shape [in_channel, num_filter, filter_height, filter_width]
strides : tuple of two ints
The spatial stride along height and width
padding : int or str
Padding size, or ['VALID', 'SAME']
out_dtype: str
The output type. This is used in mixed precision
@generic.schedule_conv2d_transpose_nchw.register(["cuda", "gpu"])
def schedule_conv2d_transpose_nchw(outs):
"""Schedule for conv2d_transpose_nchw.
Returns
-------
Output : tvm.Tensor
4-D with shape [batch, out_channel, out_height, out_width]
"""
batch, in_c, in_h, in_w = get_const_tuple(Input.shape)
_, out_c, filter_h, filter_w = get_const_tuple(Filter.shape)
stride_h, stride_w = strides
# attach stride info to config, this is used in schedule space definition
cfg.stride = strides
# padding stage
fpad_top, fpad_left, fpad_bottom, fpad_right = nn.get_pad_tuple(padding, (filter_h, filter_w))
bpad_top = filter_h - 1 - fpad_top
bpad_bottom = filter_h - 1 - fpad_bottom
bpad_left = filter_w - 1 - fpad_left
bpad_right = filter_w - 1 - fpad_right
# padding stage
FirstPad = nn.pad(Input,
[0, 0, (bpad_top + stride_h - 1) // stride_h,
(bpad_left + stride_w - 1) // stride_w],
[0, 0, (bpad_bottom + stride_h - 1) // stride_h,
(bpad_right + stride_w - 1) // stride_w], name='FirstPad')
# remove extra padding introduced by dilatation
border_h = (stride_h - bpad_top % stride_h) % stride_h
border_w = (stride_w - bpad_left % stride_w) % stride_w
# dilation stage
data = FirstPad
strides = [1, 1, stride_h, stride_w]
n = len(data.shape)
def _dilate(*indices):
not_zero = []
index_tuple = []
for i in range(n):
if not equal_const_int(strides[i], 1):
index_tuple.append(indices[i] // strides[i])
not_zero.append((indices[i] % strides[i]).equal(0))
else:
index_tuple.append(indices[i])
if not_zero:
not_zero = tvm.all(*not_zero)
return tvm.select(not_zero, data(*index_tuple), tvm.const(0.0, data.dtype))
return data(*index_tuple)
# convolution stage
out_h = (in_h - 1) * stride_h - fpad_top - fpad_bottom + filter_h
out_w = (in_w - 1) * stride_w - fpad_left - fpad_right + filter_w
dc = tvm.reduce_axis((0, in_c), name='dc')
dh = tvm.reduce_axis((0, filter_h), name='dh')
dw = tvm.reduce_axis((0, filter_w), name='dw')
Output = tvm.compute(
(batch, out_c, out_h, out_w),
lambda b, c, h, w: tvm.sum(
_dilate(b, dc, h + dh + border_h, w + dw + border_w).astype(out_dtype) *
Filter[dc, c, filter_h - 1 - dh, filter_w - 1 - dw].astype(out_dtype),
axis=[dc, dh, dw]), tag="conv2d_transpose_nchw")
return Output
@autotvm.task.register_topi_schedule(generic.schedule_conv2d_transpose_nchw,
['cuda', 'gpu'], 'direct')
def schedule_conv2d_transpose_nchw_cuda(cfg, outs):
"""TOPI Schedule callback for conv2d transpose operator.
Parameters
----------
cfg: ConfigEntity
The parameters for this template
outs: Array of Tensor
The computation graph description of conv2d_transpose_nchw
The computation graph description of conv2d transpose
in the format of an array of tensors.
Returns
-------
s: Schedule
The computation schedule for conv2d_transpose_nchw.
The computation schedule for conv2d transpose.
"""
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
batch_size = util.get_const_int(outs[0].op.output(0).shape[0])
if batch_size > 1:
raise RuntimeError("Batch size: %d is too large for this schedule" % batch_size)
return schedule_conv2d_transpose_small_batch(outs)
s = tvm.create_schedule([x.op for x in outs])
def _callback(op):
if op.tag == 'conv2d_transpose_nchw':
pad_data = op.input_tensors[0]
kernel = op.input_tensors[1]
conv = op.output(0)
##### space definition begin #####
n, f, y, x = s[conv].op.axis
rc = s[conv].op.reduce_axis[0]
cfg.define_split("tile_f", cfg.axis(f), num_outputs=4)
cfg.define_split("tile_y", cfg.axis(y), num_outputs=4)
cfg.define_split("tile_x", cfg.axis(x), num_outputs=4)
cfg.define_split("tile_rc", cfg.axis(rc), num_outputs=3)
cfg.define_knob("auto_unroll_max_step", [64, 512, 1500])
target = tvm.target.current_target()
if target.target_name in ['nvptx', 'rocm']:
cfg.define_knob("unroll_explicit", [1])
else:
cfg.define_knob("unroll_explicit", [0, 1])
##### space definition end #####
if isinstance(kernel.op, tvm.tensor.ComputeOp) and 'dilate' in kernel.op.tag:
s[kernel].compute_inline()
if conv.op in s.outputs:
output = conv
OL = s.cache_write(conv, 'local')
else:
output = s.outputs[0].output(0)
s[conv].set_scope('local')
OL = conv
# create cache stage
s[pad_data].set_scope('shared')
AA = pad_data
WW = s.cache_read(kernel, 'shared', [OL])
# tile and bind spatial axes
n, f, y, x = s[output].op.axis
kernel_scope, n = s[output].split(n, nparts=1)
bf, vf, tf, fi = cfg["tile_f"].apply(s, output, f)
by, vy, ty, yi = cfg["tile_y"].apply(s, output, y)
bx, vx, tx, xi = cfg["tile_x"].apply(s, output, x)
bf = s[output].fuse(n, bf)
s[output].bind(bf, tvm.thread_axis("blockIdx.z"))
s[output].bind(by, tvm.thread_axis("blockIdx.y"))
s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
s[output].bind(vf, tvm.thread_axis("vthread"))
s[output].bind(vy, tvm.thread_axis("vthread"))
s[output].bind(vx, tvm.thread_axis("vthread"))
s[output].bind(tf, tvm.thread_axis("threadIdx.z"))
s[output].bind(ty, tvm.thread_axis("threadIdx.y"))
s[output].bind(tx, tvm.thread_axis("threadIdx.x"))
s[output].reorder(bf, by, bx, vf, vy, vx, tf, ty, tx, fi, yi, xi)
s[OL].compute_at(s[output], tx)
# tile reduction axes
n, f, y, x = s[OL].op.axis
rc, ry, rx = s[OL].op.reduce_axis
rco, rcm, rci = cfg['tile_rc'].apply(s, OL, rc)
s[OL].reorder(rco, rcm, ry, rx, rci, n, f, y, x)
s[AA].compute_at(s[OL], rcm)
s[WW].compute_at(s[OL], rcm)
# cooperative fetching
for load in [AA, WW]:
n, f, y, x = s[load].op.axis
fused = s[load].fuse(n, f, y, x)
tz, fused = s[load].split(fused, nparts=cfg["tile_f"].size[2])
ty, fused = s[load].split(fused, nparts=cfg["tile_y"].size[2])
tx, fused = s[load].split(fused, nparts=cfg["tile_x"].size[2])
s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
s[load].bind(tx, tvm.thread_axis("threadIdx.x"))
s[output].pragma(kernel_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
s[output].pragma(kernel_scope, 'unroll_explicit', cfg['unroll_explicit'].val)
traverse_inline(s, outs[0].op, _callback)
return s
# pylint: disable=invalid-name,unused-variable,unused-argument
"""Winograd template for cuda backend"""
import numpy as np
import tvm
from tvm import autotvm
from .. import nn
from ..nn import conv2d_winograd_without_weight_transform
from ..util import get_const_int, get_const_tuple, const_matrix, traverse_inline
from ..generic import schedule_conv2d_winograd_without_weight_transform
def _winograd_conv_arg_to_workload(data, kernel, strides, padding, layout, out_dtype):
"""convert argument to workload"""
K = 3
shape = get_const_tuple(kernel.shape)
if shape[-2:] == (K, K):
raw_kernel = kernel
else: # pre-transformed
_, _, CI, CO = shape
raw_kernel = tvm.placeholder((CO, CI, K, K), dtype=kernel.dtype)
return ('conv2d', ) + autotvm.task.args_to_workload(
[data, raw_kernel, strides, padding, layout, out_dtype])
def _infer_tile_size(data, kernel):
N, CI, H, W = get_const_tuple(data.shape)
if H % 8 == 0:
return 4
return 2
def winograd_cuda(cfg, data, kernel, strides, padding, layout, out_dtype, pre_computed):
"""Compute declaration for winograd"""
assert layout == 'NCHW'
tile_size = _infer_tile_size(data, kernel)
N, CI, H, W = get_const_tuple(data.shape)
if not pre_computed: # kernel tensor is raw tensor, do strict check
CO, CI, KH, KW = get_const_tuple(kernel.shape)
HPAD, WPAD, _, _ = nn.get_pad_tuple(padding, kernel)
HSTR, WSTR = (strides, strides) if isinstance(strides, int) else strides
assert HSTR == 1 and WSTR == 1 and HPAD == 1 and WPAD == 1 and KH == 3 and KW == 3
else: # kernel tensor is pre-transfomred. this op is created by
# alter op layout, do not check
HSTR = WSTR = 1
HPAD = WPAD = 1
KH = KW = 3
_, _, CI, CO = get_const_tuple(kernel.shape)
data_pad = nn.pad(data, (0, 0, HPAD, WPAD), (0, 0, HPAD, WPAD), name="data_pad")
if tile_size == 4:
G_data = np.array([
[1 / 4.0, 0, 0],
[-1 / 6.0, -1 / 6.0, -1 / 6.0],
[-1 / 6.0, 1 / 6.0, -1 / 6.0],
[1 / 24.0, 1 / 12.0, 1 / 6.0],
[1 / 24.0, -1 / 12.0, 1 / 6.0],
[0, 0, 1]], dtype=np.float32)
B_data = np.array([
[4, 0, 0, 0, 0, 0],
[0, -4, 4, -2, 2, 4],
[-5, -4, -4, -1, -1, 0],
[0, 1, -1, 2, -2, -5],
[1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 1]], out_dtype)
A_data = np.array([
[1, 0, 0, 0],
[1, 1, 1, 1],
[1, -1, 1, -1],
[1, 2, 4, 8],
[1, -2, 4, -8],
[0, 0, 0, 1]], out_dtype)
elif tile_size == 2:
G_data = np.array([
[1, 0, 0],
[1.0/2, 1.0/2, 1.0/2],
[1.0/2, -1.0/2, 1.0/2],
[0, 0, 1]], np.float32)
B_data = np.array([
[1, 0, 0, 0],
[0, 1, -1, 1],
[-1, 1, 1, 0],
[0, 0, 0, -1]], out_dtype)
A_data = np.array([
[1, 0],
[1, 1],
[1, -1],
[0, -1]], out_dtype)
else:
raise ValueError("Unsupported tile size for winograd: " + str(tile_size))
m = A_data.shape[1]
r = 3
alpha = m + r - 1
H = (H + 2 * HPAD - KH) // HSTR + 1
W = (W + 2 * WPAD - KW) // WSTR + 1
nH, nW = (H + m-1) // m, (W + m-1) // m
P = N * nH * nW
# transform kernel
if not pre_computed:
G = const_matrix(G_data, 'G')
r_kh = tvm.reduce_axis((0, KH), name='r_kh')
r_kw = tvm.reduce_axis((0, KW), name='r_kw')
kernel_pack = tvm.compute((alpha, alpha, CI, CO), lambda eps, nu, ci, co:
tvm.sum(kernel[co][ci][r_kh][r_kw] *
G[eps][r_kh] * G[nu][r_kw],
axis=[r_kh, r_kw]), name='kernel_pack')
else:
kernel_pack = kernel
# pack input tile
input_tile = tvm.compute((CI, P, alpha, alpha), lambda c, p, eps, nu:
data_pad[p // (nH * nW)][c][p // nW % nH * m + eps]
[p % nW * m + nu], name='d')
# transform data
B = const_matrix(B_data)
r_a = tvm.reduce_axis((0, alpha), 'r_a')
r_b = tvm.reduce_axis((0, alpha), 'r_a')
data_pack = tvm.compute((alpha, alpha, CI, P), lambda eps, nu, ci, p:
tvm.sum(input_tile[ci][p][r_a][r_b] * B[r_a][eps] * B[r_b][nu],
axis=[r_a, r_b]), name='data_pack')
# do batch gemm
ci = tvm.reduce_axis((0, CI), name='ci')
bgemm = tvm.compute((alpha, alpha, CO, P), lambda eps, nu, co, p:
tvm.sum(kernel_pack[eps][nu][ci][co] *
data_pack[eps][nu][ci][p],
axis=[ci]), name='bgemm')
# inverse transform
A = const_matrix(A_data)
r_a = tvm.reduce_axis((0, alpha), 'r_a')
r_b = tvm.reduce_axis((0, alpha), 'r_a')
inverse = tvm.compute((CO, P, m, m), lambda co, p, vh, vw:
tvm.sum(bgemm[r_a][r_b][co][p] * A[r_a][vh] * A[r_b][vw],
axis=[r_a, r_b]), name='inverse')
# output
output = tvm.compute((N, CO, H, W), lambda n, co, h, w:
inverse[co][n * nH * nW + (h // m) * nW + w // m][h % m][w % m],
name='output', tag='conv2d_nchw_winograd',
attrs={"workload": _winograd_conv_arg_to_workload(
data, kernel, strides, padding, layout, out_dtype)})
cfg.add_flop(2 * N * CO * H * W * CI * KH * KW)
return output
def schedule_winograd_cuda(cfg, s, output, pre_computed):
"""Schedule winograd template"""
# get stages
inverse = s[output].op.input_tensors[0]
bgemm, A = s[inverse].op.input_tensors
kernel_pack, data_pack = s[bgemm].op.input_tensors
input_tile, B = s[data_pack].op.input_tensors
pad_data = s[input_tile].op.input_tensors[0]
# data transform
s[B].compute_inline()
data_l = s.cache_write(data_pack, 'local')
eps, nu, c, p = s[data_l].op.axis
r_a, r_b = s[data_l].op.reduce_axis
for axis in [eps, nu, r_a, r_b]:
s[data_l].unroll(axis)
eps, nu, c, p = s[data_pack].op.axis
p, pi = s[data_pack].split(p, 1)
fused = s[data_pack].fuse(c, p)
bb, tt = s[data_pack].split(fused, 128)
s[data_pack].reorder(bb, tt, pi, eps, nu)
s[data_pack].bind(bb, tvm.thread_axis("blockIdx.x"))
s[data_pack].bind(tt, tvm.thread_axis("threadIdx.x"))
s[data_l].compute_at(s[data_pack], pi)
s[input_tile].compute_at(s[data_pack], pi)
s[pad_data].compute_inline()
# transform kernel
if not pre_computed:
kernel, G = s[kernel_pack].op.input_tensors
eps, nu, ci, co = s[kernel_pack].op.axis
if autotvm.GLOBAL_SCOPE.in_tuning:
# skip this part during tuning to make recrods accurate
# this part will be pre-computed during NNVM's pre-compute optimization pass
s[G].pragma(s[G].op.axis[0], 'debug_skip_region')
s[kernel_pack].pragma(eps, 'debug_skip_region')
else:
s[G].compute_inline()
r_a, r_b = s[kernel_pack].op.reduce_axis
for axis in [eps, nu, r_a, r_b]:
s[kernel_pack].unroll(axis)
fused = s[kernel_pack].fuse(ci, co)
bb, tt = s[kernel_pack].split(fused, 128)
s[kernel_pack].reorder(bb, tt, eps, nu, r_a, r_b)
s[kernel_pack].bind(bb, tvm.thread_axis("blockIdx.x"))
s[kernel_pack].bind(tt, tvm.thread_axis("threadIdx.x"))
else:
kernel = kernel_pack
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
##### space definition begin #####
b1, b2, y, x = s[bgemm].op.axis
rc = s[bgemm].op.reduce_axis[0]
alpha = get_const_int(b1.dom.extent)
cfg.define_split("tile_b", cfg.axis(alpha * alpha), num_outputs=4,
filter=lambda x: x.size[-3:] == [1, 1, 1])
cfg.define_split("tile_y", y, num_outputs=4)
cfg.define_split("tile_x", x, num_outputs=4)
cfg.define_split("tile_rc", rc, num_outputs=2)
cfg.define_knob("auto_unroll_max_step", [0, 128, 1500])
target = tvm.target.current_target()
if target.target_name in ['nvptx', 'rocm']:
cfg.define_knob("unroll_explicit", [1])
else:
cfg.define_knob("unroll_explicit", [0, 1])
##### space definition end #####
# batch gemm
C = bgemm
A0, B0 = kernel_pack, data_pack
OL = s.cache_write(C, 'local')
AA = s.cache_read(A0, 'shared', [OL])
BB = s.cache_read(B0, 'shared', [OL])
b = s[bgemm].fuse(b1, b2)
# tile and bind spatial axes
bgemm_scope, b = s[bgemm].split(b, nparts=1)
bz, vz, tz, zi = cfg["tile_b"].apply(s, C, b)
by, vy, ty, yi = cfg["tile_y"].apply(s, C, y)
bx, vx, tx, xi = cfg["tile_x"].apply(s, C, x)
s[C].bind(bz, tvm.thread_axis("blockIdx.z"))
s[C].bind(by, tvm.thread_axis("blockIdx.y"))
s[C].bind(bx, tvm.thread_axis("blockIdx.x"))
s[C].bind(vz, tvm.thread_axis("vthread"))
s[C].bind(vy, tvm.thread_axis("vthread"))
s[C].bind(vx, tvm.thread_axis("vthread"))
s[C].bind(tz, tvm.thread_axis("threadIdx.z"))
s[C].bind(ty, tvm.thread_axis("threadIdx.y"))
s[C].bind(tx, tvm.thread_axis("threadIdx.x"))
s[C].reorder(bgemm_scope, bz, by, bx, vz, vy, vx, tz, ty, tx, zi, yi, xi)
# tile reduction axes
s[OL].compute_at(s[C], tx)
b1, b2, y, x = s[OL].op.axis
b = s[OL].fuse(b1, b2)
rc, = s[OL].op.reduce_axis
rco, rci = cfg['tile_rc'].apply(s, OL, rc)
s[OL].reorder(rco, rci, b, y, x)
s[AA].compute_at(s[OL], rco)
s[BB].compute_at(s[OL], rco)
# cooperative fetching
for load in [AA, BB]:
fused = s[load].fuse(*list(s[load].op.axis))
fused, tx = s[load].split(fused, cfg["tile_x"].size[2])
fused, ty = s[load].split(fused, cfg["tile_y"].size[2])
fused, tz = s[load].split(fused, cfg["tile_b"].size[2])
s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
s[load].bind(tx, tvm.thread_axis("threadIdx.x"))
s[C].pragma(bgemm_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
s[C].pragma(bgemm_scope, 'unroll_explicit', cfg['unroll_explicit'].val)
# schedule inverse, output and fusion
if output.op in s.outputs:
OL = None
else:
OL = output
s[OL].set_scope('local')
output = s.outputs[0]
m = alpha - 3 + 1
n, co, h, w = s[output].op.axis
ho, wo, hi, wi = s[output].tile(h, w, m, m)
inverse_scope, n = s[output].split(n, nparts=1)
fused = s[output].fuse(n, co, ho, wo)
bb, tt = s[output].split(fused, 128)
s[output].bind(bb, tvm.thread_axis("blockIdx.x"))
s[output].bind(tt, tvm.thread_axis("threadIdx.x"))
if OL is not None:
s[OL].compute_at(s[output], tt)
s[A].compute_inline()
co, p, vh, vw = s[inverse].op.axis
r_a, r_b = s[inverse].op.reduce_axis
for axis in [vh, vw, r_a, r_b]:
s[inverse].unroll(axis)
s[inverse].compute_at(s[output], tt)
return s
##### REGISTER TOPI COMPUTE / SCHEDULE FOR WINOGRAD WITH WEIGHT TRANSFORM #####
@conv2d_winograd_without_weight_transform.register(['cuda', 'gpu'])
@autotvm.task.dispatcher
def winograd_ww_config_dispatcher_cuda(data, kernel, strides, padding, layout, out_dtype,
tile_size):
return _winograd_conv_arg_to_workload(data, kernel, strides, padding, layout, out_dtype)
@winograd_ww_config_dispatcher_cuda.register(['winograd'])
def decl_winograd_ww(cfg, data, kernel, strides, padding, layout, out_dtype, tile_size):
return winograd_cuda(cfg, data, kernel, strides, padding, layout, out_dtype, pre_computed=True)
@autotvm.register_topi_schedule(schedule_conv2d_winograd_without_weight_transform,
['cuda', 'gpu'], ['winograd'])
def schedule_conv2d_winograd_without_weight_transform_cuda(cfg, outs):
"""TOPI schedule callback"""
s = tvm.create_schedule([x.op for x in outs])
def _callback(op):
if 'conv2d_nchw_winograd' in op.tag:
schedule_winograd_cuda(cfg, s, op.output(0), pre_computed=True)
traverse_inline(s, outs[0].op, _callback)
return s
##### REGISTER ALTER OP LAYOUT #####
@nn.conv2d_alter_layout.register(["cuda", "gpu"])
def _alter_conv2d_layout(attrs, inputs, tinfos):
"""Alter op layout for pre-computing kernel transformation"""
if 'cudnn' in tvm.target.current_target().libs or 'miopen' in tvm.target.current_target().libs:
return None
import nnvm.symbol as sym
copy_inputs = [s for s in inputs]
new_attrs = {k: attrs[k] for k in attrs.keys()}
assert attrs.get_int_tuple("dilation") == (1, 1), "Does not support dilation " \
"when alter_op_layout is enabled"
strides = attrs.get_int_tuple("strides")
padding = attrs.get_int_tuple("padding")
groups = attrs.get_int('groups')
layout = attrs["layout"]
out_dtype = attrs["out_dtype"]
out_dtype = tinfos[0].dtype if out_dtype == "same" else out_dtype
if groups == 1:
# query config of this workload
workload = ('conv2d',) + autotvm.task.args_to_workload(
[tinfos[0], tinfos[1], strides, padding, layout, out_dtype])
cfg = autotvm.DispatchContext.current.query(tvm.target.current_target(), workload)
if cfg.is_fallback: # if is fallback, clear query cache and return None
autotvm.task.clear_fallback_cache(tvm.target.current_target(), workload)
return None
if cfg.template_key == 'direct':
return None
# pre-compute weight transformation in winograd
tile_size = _infer_tile_size(tinfos[0], tinfos[1])
weight = sym.contrib.conv2d_winograd_weight_transform(copy_inputs[1],
tile_size=tile_size)
weight = sym.transpose(weight, axes=[0, 1, 3, 2])
copy_inputs[1] = weight
new_attrs['tile_size'] = tile_size
return sym.contrib.conv2d_winograd_without_weight_transform(*copy_inputs, **new_attrs)
# do nothing for depthwise convolution
return None
# pylint: disable=invalid-name
"""Schedule for depthwise_conv2d with auto fusion"""
import tvm
from ..util import get_const_tuple
from tvm import autotvm
from ..util import traverse_inline
from .. import tag
from .. import generic
from .. import generic, nn
@generic.schedule_depthwise_conv2d_nchw.register(["cuda", "gpu"])
def schedule_depthwise_conv2d_nchw(outs):
# register original implementation of depthwise_conv2d_nchw since we don't need to change this part
autotvm.register_topi_compute(nn.depthwise_conv2d_nchw, ['cuda', 'gpu'], 'direct',
nn.depthwise_conv2d_nchw.fdefault)
@autotvm.register_topi_schedule(generic.schedule_depthwise_conv2d_nchw, ['cuda', 'gpu'], 'direct')
def schedule_depthwise_conv2d_nchw_cuda(cfg, outs):
"""Schedule for depthwise_conv2d nchw forward.
Parameters
......@@ -22,108 +27,92 @@ def schedule_depthwise_conv2d_nchw(outs):
"""
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
s = tvm.create_schedule([x.op for x in outs])
def _schedule(PaddedInput, Filter, DepthwiseConv2d):
in_shape = get_const_tuple(PaddedInput.shape)
out_shape = get_const_tuple(DepthwiseConv2d.shape)
in_height = in_shape[2]
in_width = in_shape[3]
out_height = out_shape[2]
out_width = out_shape[3]
channel_multiplier = get_const_tuple(Filter.shape)[1]
s[PaddedInput].compute_inline()
IS = s.cache_read(PaddedInput, "shared", [DepthwiseConv2d])
FS = s.cache_read(Filter, "shared", [DepthwiseConv2d])
IL = s.cache_read(IS, "local", [DepthwiseConv2d])
FL = s.cache_read(FS, "local", [DepthwiseConv2d])
if DepthwiseConv2d.op in s.outputs:
Output = DepthwiseConv2d
CL = s.cache_write(DepthwiseConv2d, "local")
else:
Output = outs[0].op.output(0)
s[DepthwiseConv2d].set_scope("local")
# schedule parameters
num_thread_y = 8
num_thread_x = 8
num_vthread_y = 1
num_vthread_x = 1
blocking_h = out_height
blocking_w = out_width
if out_height % 32 == 0 or in_height >= 108:
blocking_h = 32
if out_width % 32 == 0:
blocking_w = 32
num_thread_x = 16
num_vthread_x = 2
elif in_width >= 108:
blocking_w = 32
block_y = tvm.thread_axis("blockIdx.y")
block_x = tvm.thread_axis("blockIdx.x")
thread_y = tvm.thread_axis((0, num_thread_y), "threadIdx.y")
thread_x = tvm.thread_axis((0, num_thread_x), "threadIdx.x")
thread_vy = tvm.thread_axis((0, num_vthread_y), "vthread", name="vy")
thread_vx = tvm.thread_axis((0, num_vthread_x), "vthread", name="vx")
# split and bind
by, byi = s[Output].split(Output.op.axis[1], factor=channel_multiplier)
s[Output].reorder(Output.op.axis[2], Output.op.axis[3], byi)
by = s[Output].fuse(Output.op.axis[0], by)
s[Output].bind(by, block_y)
bx1, x1i = s[Output].split(Output.op.axis[2], factor=blocking_h)
tvy, vyi = s[Output].split(x1i, nparts=num_vthread_y)
ty, yi = s[Output].split(vyi, nparts=num_thread_y)
bx2, x2i = s[Output].split(Output.op.axis[3], factor=blocking_w)
tvx, vxi = s[Output].split(x2i, nparts=num_vthread_x)
tx, xi = s[Output].split(vxi, nparts=num_thread_x)
s[Output].reorder(bx1, bx2, tvy, tvx, ty, tx, yi, xi)
bx = s[Output].fuse(bx1, bx2)
s[Output].bind(bx, block_x)
s[Output].bind(tvy, thread_vy)
s[Output].bind(tvx, thread_vx)
s[Output].bind(ty, thread_y)
s[Output].bind(tx, thread_x)
# local memory load
s[IL].compute_at(s[Output], tx)
s[FL].compute_at(s[Output], tx)
if DepthwiseConv2d.op in s.outputs:
s[CL].compute_at(s[Output], tx)
else:
s[DepthwiseConv2d].compute_at(s[Output], tx)
# input's shared memory load
s[IS].compute_at(s[Output], bx)
ty, yi = s[IS].split(IS.op.axis[2], nparts=num_thread_y)
tx, xi = s[IS].split(IS.op.axis[3], nparts=num_thread_x)
s[IS].bind(ty, thread_y)
s[IS].bind(tx, thread_x)
# filter's shared memory load
s[FS].compute_at(s[Output], bx)
s[FS].reorder(FS.op.axis[2], FS.op.axis[3], FS.op.axis[1])
ty, yi = s[FS].split(FS.op.axis[2], nparts=num_thread_y)
tx, xi = s[FS].split(FS.op.axis[3], nparts=num_thread_x)
s[FS].bind(ty, thread_y)
s[FS].bind(tx, thread_x)
scheduled_ops = []
def traverse(OP):
"""Internal travserse function"""
# inline all one-to-one-mapping operators except the last stage (output)
if tag.is_broadcast(OP.tag):
if OP not in s.outputs:
s[OP].compute_inline()
for tensor in OP.input_tensors:
if tensor.op.input_tensors and tensor.op not in scheduled_ops:
traverse(tensor.op)
# schedule depthwise_conv2d
if OP.tag == 'depthwise_conv2d_nchw':
PaddedInput = OP.input_tensors[0]
Filter = OP.input_tensors[1]
if isinstance(Filter.op, tvm.tensor.ComputeOp) and 'dilate' in Filter.op.tag:
s[Filter].compute_inline()
DepthwiseConv2d = OP.output(0)
_schedule(PaddedInput, Filter, DepthwiseConv2d)
scheduled_ops.append(OP)
traverse(outs[0].op)
def _callback(op):
if op.tag == 'depthwise_conv2d_nchw':
pad_data = op.input_tensors[0]
kernel = op.input_tensors[1]
conv = op.output(0)
##### space definition begin #####
n, f, y, x = s[conv].op.axis
cfg.define_split("tile_f", f, num_outputs=4)
cfg.define_split("tile_y", y, num_outputs=4)
cfg.define_split("tile_x", x, num_outputs=4)
cfg.define_knob("auto_unroll_max_step", [0, 256, 1500])
target = tvm.target.current_target()
if target.target_name in ['nvptx', 'rocm']:
cfg.define_knob("unroll_explicit", [1])
else:
cfg.define_knob("unroll_explicit", [0, 1])
# fallback support
if cfg.is_fallback:
ref_log = autotvm.tophub.load_reference_log(
target.target_name, target.model, 'depthwise_conv2d_nchw', 'direct')
cfg.fallback_with_reference_log(ref_log)
# TODO(lmzheng): A bug here, set unroll_explicit to False as workaround
cfg['unroll_explicit'].val = 0
##### space definition end #####
s[pad_data].compute_inline()
if isinstance(kernel.op, tvm.tensor.ComputeOp) and 'dilate' in kernel.op.tag:
s[kernel].compute_inline()
if conv.op in s.outputs:
output = conv
OL = s.cache_write(conv, 'local')
else:
output = s.outputs[0].output(0)
s[conv].set_scope('local')
OL = conv
# create cache stage
AA = s.cache_read(pad_data, 'shared', [OL])
WW = s.cache_read(kernel, 'shared', [OL])
AL = s.cache_read(AA, 'local', [OL])
WL = s.cache_read(WW, 'local', [OL])
# tile and bind spatial axes
n, f, y, x = s[output].op.axis
bf, vf, tf, fi = cfg["tile_f"].apply(s, output, f)
by, vy, ty, yi = cfg["tile_y"].apply(s, output, y)
bx, vx, tx, xi = cfg["tile_x"].apply(s, output, x)
kernel_scope, n = s[output].split(n, nparts=1)
bf = s[output].fuse(n, bf)
s[output].bind(bf, tvm.thread_axis("blockIdx.z"))
s[output].bind(by, tvm.thread_axis("blockIdx.y"))
s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
s[output].bind(vf, tvm.thread_axis("vthread"))
s[output].bind(vy, tvm.thread_axis("vthread"))
s[output].bind(vx, tvm.thread_axis("vthread"))
s[output].bind(tf, tvm.thread_axis("threadIdx.z"))
s[output].bind(ty, tvm.thread_axis("threadIdx.y"))
s[output].bind(tx, tvm.thread_axis("threadIdx.x"))
s[output].reorder(bf, by, bx, vf, vy, vx, tf, ty, tx, fi, yi, xi)
s[OL].compute_at(s[output], tx)
# cooperative fetching
s[AA].compute_at(s[output], bx)
s[WW].compute_at(s[output], bx)
s[AL].compute_at(s[output], tx)
s[WL].compute_at(s[output], tx)
for load in [AA, WW]:
fused = s[load].fuse(*list(s[load].op.axis))
fused, tx = s[load].split(fused, cfg["tile_x"].size[2])
fused, ty = s[load].split(fused, cfg["tile_y"].size[2])
fused, tz = s[load].split(fused, cfg["tile_f"].size[2])
s[load].bind(tz, tvm.thread_axis("threadIdx.z"))
s[load].bind(ty, tvm.thread_axis("threadIdx.y"))
s[load].bind(tx, tvm.thread_axis("threadIdx.x"))
s[output].pragma(kernel_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
s[output].pragma(kernel_scope, 'unroll_explicit', cfg['unroll_explicit'].val)
traverse_inline(s, outs[0].op, _callback)
return s
@generic.schedule_depthwise_conv2d_nhwc.register(["cuda", "gpu"])
......@@ -143,8 +132,8 @@ def schedule_depthwise_conv2d_nhwc(outs):
"""
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
s = tvm.create_schedule([x.op for x in outs])
def _schedule(temp, Filter, DepthwiseConv2d):
def _schedule(temp, Filter, DepthwiseConv2d):
s[temp].compute_inline()
FS = s.cache_read(Filter, "shared", [DepthwiseConv2d])
if DepthwiseConv2d.op in s.outputs:
......
# pylint: disable=invalid-name,unused-variable,unused-argument,no-else-return
"""conv2d schedule on ARM Mali GPU"""
from __future__ import absolute_import as _abs
import numpy as np
import tvm
from .. import generic
from .. import util
from .. import tag
from ..nn import pad
from ..nn.conv2d import conv2d
from ..nn.util import get_pad_tuple
##### SCHEDULE UTILITIES #####
def fuse_and_bind(s, tensor, axis=None, num_thread=None):
""" fuse all the axis and bind to GPU threads """
axis = axis or s[tensor].op.axis
fused = s[tensor].fuse(*axis)
max_threads = tvm.target.current_target(allow_none=False).max_num_threads
bx, tx = s[tensor].split(fused, num_thread or max_threads)
s[tensor].bind(bx, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(tx, tvm.thread_axis("threadIdx.x"))
return bx, tx
def tile_and_bind(s, tensor, y, x, y_factor, x_factor=None):
""" tile and bind to GPU threads """
x_factor = x_factor or y_factor
yo, xo, yi, xi = s[tensor].tile(y, x, y_factor, x_factor)
s[tensor].bind(xo, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(xi, tvm.thread_axis("threadIdx.x"))
s[tensor].bind(yo, tvm.thread_axis("blockIdx.y"))
s[tensor].bind(yi, tvm.thread_axis("threadIdx.y"))
return yo, xo, yi, xi
def tile_and_bind3d(s, tensor, z, y, x, z_factor=2, y_factor=None, x_factor=None):
""" tile and bind 3d """
y_factor = y_factor or z_factor
x_factor = x_factor or y_factor
zo, zi = s[tensor].split(z, z_factor)
yo, yi = s[tensor].split(y, y_factor)
xo, xi = s[tensor].split(x, x_factor)
s[tensor].bind(zo, tvm.thread_axis("blockIdx.z"))
s[tensor].bind(zi, tvm.thread_axis("threadIdx.z"))
s[tensor].bind(yo, tvm.thread_axis("blockIdx.y"))
s[tensor].bind(yi, tvm.thread_axis("threadIdx.y"))
s[tensor].bind(xo, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(xi, tvm.thread_axis("threadIdx.x"))
def pack_tensor(s, tensor, factor, readers):
""" do transform X[n, m] -> X[n / factor, m, factor] """
tmp = s.cache_read(tensor, 'global', readers)
y, x = s[tmp].op.axis
yo, yi = s[tmp].split(y, factor)
s[tmp].reorder(yo, x, yi)
s[tmp].compute_inline()
return s.cache_write(tmp, 'global')
def transpose(s, tensor, readers):
""" do transform X[n, m] -> X[m, n] """
tmp = s.cache_read(tensor, 'global', readers)
y, x = s[tmp].op.axis
s[tmp].reorder(x, y)
s[tmp].compute_inline()
return s.cache_write(tmp, "global"), tmp
def const_array(data, name):
""" convert an const array to tvm tensor"""
row, col = data.shape
dtype = str(data.dtype)
def select_array(i, j):
now = tvm.const(0.0, dtype)
for ii in range(row):
for jj in range(col):
now = tvm.select(tvm.all(i % row == ii, j % col == jj),
tvm.const(data[ii][jj], dtype),
now)
return now
return tvm.compute(data.shape, select_array, name=name)
@conv2d.register(["mali"])
def decl_conv2d(data, kernel, stride, padding, layout='NCHW', out_dtype='float32'):
"""Conv2D operator for ARM Mali GPU backend.
Parameters
----------
data : tvm.Tensor
4-D with shape [batch, in_channel, in_height, in_width]
import tvm
from tvm import autotvm
from tvm.autotvm.task.space import get_factors
kernel : tvm.Tensor
4-D with shape [num_filter, in_channel, filter_height, filter_width]
from ..generic import schedule_conv2d_nchw, schedule_conv2d_winograd_without_weight_transform
from ..util import traverse_inline, get_const_int, get_const_tuple, const_matrix
from ..nn import conv2d, conv2d_winograd_without_weight_transform, \
get_pad_tuple, pad
stride : int or a list/tuple of two ints
stride size, or [stride_height, stride_width]
# reuse some compute declarations from ARM CPU
from ..arm_cpu.conv2d import _conv_arg_to_workload, _decl_spatial_pack,\
_winograd_conv_arg_to_workload
padding : int or a list/tuple of two ints
padding size, or [pad_height, pad_width]
layout : str
layout of data
@conv2d.register('mali')
@autotvm.task.dispatcher
def conv2d_mali(data, kernel, strides, padding, layout, out_dtype):
"""TOPI compute callback. Mark this function as a dispatcher, so
this template can assign config according to workload
Returns
-------
output : tvm.Tensor
4-D with shape [batch, out_channel, out_height, out_width]
workload: Tuple
Dispatcher will use this workload to query corresponding config.
Then use cfg.template_key to call a registered template.
"""
assert layout == 'NCHW', "only support NCHW convolution on mali"
assert data.shape[0].value == 1, "only support batch size=1 convolution on mali"
assert data.dtype == kernel.dtype, "Do not support inputs with different data types now."
return _conv_arg_to_workload(data, kernel, strides, padding, layout, out_dtype)
out_dtype = data.dtype
HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel)
kernel_shape = util.get_const_tuple(kernel.shape)
if isinstance(stride, (tuple, list)):
HSTR, WSTR = stride
else:
HSTR, WSTR = stride, stride
@conv2d_mali.register(['direct'])
def decl_spatial_pack(cfg, data, kernel, strides, padding, layout, out_dtype):
"""spatial packing template"""
return _decl_spatial_pack(cfg, data, kernel, strides, padding, layout, out_dtype, num_tile=3)
if (kernel_shape[2:4] == (3, 3) and (HPAD, WPAD) == (1, 1) and kernel_shape[0] >= 64 and
(HSTR, WSTR) == (1, 1)):
return _decl_winograd(data, kernel, stride, padding, layout, out_dtype)
elif kernel_shape[2:4] == (1, 1):
return _decl_im2col(data, kernel, stride, padding, layout, out_dtype)
else:
return _decl_spatialpack(data, kernel, stride, padding, layout, out_dtype)
@generic.schedule_conv2d_nchw.register(["mali"])
def schedule_conv2d_nchw(outs):
"""Schedule for conv2d_nchw for ARM Mali GPU
@autotvm.register_topi_schedule(schedule_conv2d_nchw, 'mali', ['direct', 'winograd'])
def schedule_conv2d_nchw_mali(cfg, outs):
"""TOPI schedule callback for conv2d
Parameters
----------
cfg: ConfigEntity
The configuration of this template
outs: Array of Tensor
The computation graph description of conv2d_nchw
The computation graph description of convolution2d
in the format of an array of tensors.
Returns
-------
s: Schedule
The computation schedule for conv2d_nchw.
The computation schedule for conv2d
"""
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
s = tvm.create_schedule([x.op for x in outs])
scheduled_ops = []
def traverse(op):
"""inline all one-to-one-mapping operators except the last stage (output)"""
if tag.is_broadcast(op.tag):
if op not in s.outputs:
s[op].compute_inline()
for tensor in op.input_tensors:
if tensor.op.input_tensors and tensor.op not in scheduled_ops:
traverse(tensor.op)
def _callback(op):
# schedule conv2d
if 'spatial_conv2d_output' in op.tag:
output = op.output(0)
conv = op.input_tensors[0]
if 'im2col_conv_output' in op.tag:
_schedule_im2col_conv2d(s, op)
data_vec = conv.op.input_tensors[0]
data_pad = data_vec.op.input_tensors[0]
s[data_pad].compute_inline()
if 'spatialpack_conv_output' in op.tag:
_schedule_spatialpack_conv2d(s, op)
kernel_vec = conv.op.input_tensors[1]
if kernel_vec.op.name == 'kernel_vec':
kernel = kernel_vec.op.input_tensors[0]
else:
kernel = kernel_vec
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
if 'winograd_conv_output' in op.tag:
_schedule_winograd(s, op)
_schedule_spatial_pack(cfg, s, output, conv, data_vec, kernel_vec)
scheduled_ops.append(op)
if 'winograd_conv2d_output' in op.tag:
_schedule_winograd(cfg, s, op)
traverse(outs[0].op)
traverse_inline(s, outs[0].op, _callback)
return s
def _decl_spatialpack(data, kernel, stride, padding, layout, out_dtype):
"""declare the spatialpack method (spatial packing) for conv2d"""
_, CI, IH, IW = [util.get_const_int(x) for x in data.shape]
CO, _, KH, KW = [util.get_const_int(x) for x in kernel.shape]
HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel)
HCAT, WCAT = KH - 1, KW - 1
if isinstance(stride, (tuple, list)):
HSTR, WSTR = stride
else:
HSTR, WSTR = stride, stride
N = 1
TH = IH + 2*HPAD
TW = IW + 2*WPAD
OH = (IH + 2*HPAD - KH) // HSTR + 1
OW = (IW + 2*WPAD - KW) // WSTR + 1
DO_PAD = (HPAD != 0 and WPAD != 0)
if DO_PAD:
data_pad = pad(data, (0, 0, HPAD, WPAD), name="data_pad")
else:
data_pad = data
# set tunable parameters (tile factor, ...)
tune_config = getattr(tvm.target.current_target(), "tune_config", None)
if tune_config is None:
VH = 1
VW, VC = 4, 4
# correct tile factor
if OW % VW != 0:
if OW == 14:
VW = 2
VC = 8
elif OW == 7:
VW = 7
else:
VH = tune_config['VH']
VW = tune_config['VW']
VC = tune_config['VC']
if data.dtype == 'float16':
VC *= 2
assert CO % VC == 0
assert OH % VH == 0, "OH: %d VH : %d" % (OH, VH)
assert OW % VW == 0, "OW: %d VW : %d" % (OW, VW)
dvshape = (N, TH//(VH*HSTR), TW//(VW*WSTR), CI, VH*HSTR+HCAT, VW*WSTR+WCAT)
kvshape = (CO // VC, CI, KH, KW, VC)
ovshape = (N, CO // VC, OH // VH, OW // VW, VH, VW, VC)
oshape = (N, CO, OH, OW)
data_vec = tvm.compute(dvshape, lambda n, h, w, ci, vh, vw:
data_pad[n][ci][h*VH*HSTR+vh][w*VW*WSTR+vw],
name='data_vec')
kernel_vec = tvm.compute(kvshape, lambda co, ci, kh, kw, vc:
kernel[co*VC+vc][ci][kh][kw],
name='kernel_vec')
ci = tvm.reduce_axis((0, CI), name='ci')
kh = tvm.reduce_axis((0, KH), name='kh')
kw = tvm.reduce_axis((0, KW), name='kw')
conv = tvm.compute(ovshape, lambda n, co, h, w, vh, vw, vc:\
tvm.sum(data_vec[n, h, w, ci, vh*HSTR+kh, vw*WSTR+kw].astype(out_dtype) *
kernel_vec[co, ci, kh, kw, vc].astype(out_dtype),
axis=[ci, kh, kw]), name='conv')
output = tvm.compute(oshape, lambda n, co, h, w:
conv[n][co//VC][h/VH][w//VW][h%VH][w%VW][co%VC],
name='output_unpack', tag='spatialpack_conv_output')
return output
def _schedule_spatialpack_conv2d(s, op):
"""schedule the spatialpack method (spatial packing) for conv2d"""
# get ops and tensors
output = op.output(0)
output_height = util.get_const_int(output.shape[2])
conv = op.input_tensors[0]
data_vec = s[conv].op.input_tensors[0]
kernel_vec = s[conv].op.input_tensors[1]
def _schedule_spatial_pack(cfg, s, output, conv, data_vec, kernel_vec):
"""schedule the spatial packing for conv2d"""
data = s[data_vec].op.input_tensors[0]
kernel = s[kernel_vec].op.input_tensors[0]
# set tunable parameters (tile factor, ...)
tune_config = getattr(tvm.target.current_target(), "tune_config", None)
if tune_config is None:
num_thread = 8
out_channel = util.get_const_int(kernel.shape[0])
in_channel = util.get_const_int(kernel.shape[1])
in_width = util.get_const_int(data.shape[2])
if in_width >= 224:
pass
elif in_width >= 112:
pass
elif in_width >= 56:
if out_channel != in_channel:
num_thread = 16
elif in_width >= 28:
if out_channel >= 256:
num_thread = 16
elif in_width >= 14:
if in_channel == out_channel:
num_thread = 8
else:
num_thread = 4
else:
num_thread = tune_config["num_thread"]
last = 1
if output_height == 28:
last = 7
num_thread = 32
if data.dtype == 'float16' and (util.get_const_int(conv.shape[1]) == 4 or output_height == 28):
num_thread //= 2
# schedule dilation
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
max_unroll = 16
vec_size = [1, 2, 4, 8, 16]
# get tunable parameters (they are defined in compute)
BC, TC, VC = cfg["tile_co"].size
BH, TH, VH = cfg["tile_oh"].size
BW, TW, VW = cfg["tile_ow"].size
# schedule padding
if isinstance(data.op, tvm.tensor.ComputeOp) and "pad" in data.op.tag:
data_pad = data
data = data_pad.op.input_tensors[0]
s[data_pad].compute_inline()
# schedule data packing
_, h, w, ci, vh, vw = s[data_vec].op.axis
tile_and_bind3d(s, data_vec, h, w, ci, 1)
s[data_vec].unroll(vw)
# schedule kernel packing
co, ci, kh, kw, vc = s[kernel_vec].op.axis
tile_and_bind(s, kernel_vec, co, ci, 1)
s[kernel_vec].unroll(kh)
s[kernel_vec].unroll(kw)
s[kernel_vec].vectorize(vc)
if vh.dom.extent.value < max_unroll:
s[data_vec].unroll(vh)
if vw.dom.extent.value < max_unroll:
s[data_vec].unroll(vw)
if isinstance(kernel_vec.op, tvm.tensor.ComputeOp) and kernel_vec.name == 'kernel_vec':
if autotvm.GLOBAL_SCOPE.in_tuning:
# kernel packing will be pre-computed during compilation, so we skip
# this part to make tuning records correct
s[kernel_vec].pragma(s[kernel_vec].op.axis[0], 'debug_skip_region')
else:
max_threads = tvm.target.current_target(allow_none=False).max_num_threads
co, ci, kh, kw, vc = s[kernel_vec].op.axis
fused = s[kernel_vec].fuse(co, ci, kh, kw, vc)
fused, vec = s[kernel_vec].split(fused, VC)
bb, tt = s[kernel_vec].split(fused, max_threads)
s[kernel_vec].bind(bb, tvm.thread_axis("blockIdx.x"))
s[kernel_vec].bind(tt, tvm.thread_axis("threadIdx.x"))
if VC in vec_size:
s[kernel_vec].vectorize(vec)
# schedule convolution
_, c, h, w, vh, vw, vc = s[conv].op.axis
n, c, h, w, vh, vw, vc = s[conv].op.axis
kc, kh, kw = s[conv].op.reduce_axis
s[conv].reorder(_, c, h, w, vh, kc, kh, kw, vw, vc)
tile_and_bind3d(s, conv, c, h, w, num_thread, 1, last)
s[conv].unroll(kh)
s[conv].unroll(kw)
s[conv].unroll(vw)
s[conv].vectorize(vc)
cfg["reorder_0"].apply(s, conv, [n, c, h, w, kc, kh, kw, vh, vw, vc])
tile_and_bind3d(s, conv, c, h, w, TC, TH, TW)
cfg["ann_reduce"].apply(s, conv, [kh, kw],
axis_lens=[get_const_int(kernel_vec.shape[2]),
get_const_int(kernel_vec.shape[3])],
max_unroll=max_unroll)
cfg["ann_spatial"].apply(s, conv, [vh, vw, vc],
axis_lens=[VH, VW, VC],
max_unroll=max_unroll,
vec_size=vec_size,
cfg=cfg)
# schedule output
if output.op not in s.outputs: # has bias
......@@ -330,364 +145,324 @@ def _schedule_spatialpack_conv2d(s, op):
output = s.outputs[0]
_, co, oh, ow = s[output].op.axis
tile_and_bind3d(s, output, co, oh, ow, num_thread, 1, last)
def _decl_im2col(data, kernel, stride, padding, layout='NCHW', out_dtype='float32'):
"""declare the Im2Col method for conv2d"""
_, CI, IH, IW = [x.value for x in data.shape]
CO, _, KH, KW = [x.value for x in kernel.shape]
HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel)
if isinstance(stride, (tuple, list)):
HSTR, WSTR = stride
else:
HSTR, WSTR = stride, stride
N = 1
OH = (IH + 2*HPAD - KH) // HSTR + 1
OW = (IW + 2*WPAD - KW) // WSTR + 1
DO_PAD = (HPAD != 0 and WPAD != 0)
if DO_PAD:
data_pad = pad(data, (0, 0, HPAD, WPAD), name="data_pad")
else:
data_pad = data
tile_and_bind3d(s, output, co, oh, ow, TC, TH, TW)
ALIGN = 16
def upround(x, align):
return (x + align - 1) // align * align
# A [CO, CI * KH * KW]
reduce_len = upround(CI * KH * KW, ALIGN)
A = tvm.compute((upround(CO, ALIGN), reduce_len), lambda i, j:
kernel[i][j // KW // KH][j // KW % KH][j % KW], name='A')
# B [CI * KH * KW, N * OH * OW]
B = tvm.compute((reduce_len, upround(N * OH * OW, ALIGN)), lambda i, j:\
tvm.select(tvm.all(i < CI * KH * KW, j < N * OH * OW),
data_pad[j // (OH*OW)][i // (KH*KW)][j // OW % OH*HSTR + i // KW % KH]
[j % OW*WSTR + i % KW],
tvm.const(0, data_pad.dtype)), name='B')
gemm_n, gemm_l, gemm_m = A.shape[0], reduce_len, B.shape[1]
# C [CO, N * OH * OW]
k = tvm.reduce_axis((0, gemm_l), name='k')
C = tvm.compute((gemm_n, gemm_m), lambda i, j: tvm.sum(A[i, k] * B[k, j], axis=k), name='C')
# output
# the last term C[gemm_n-1, gemm_m-1] is for enabling the alignment,
# otherwise the alignment above will be eliminated by bound inference
output = tvm.compute((N, CO, OH, OW), lambda n, co, h, w:\
C[co][n * OW * OW + h * OW + w] + tvm.const(0, C.dtype) * C[gemm_n-1, gemm_m-1],
name='output', tag='im2col_conv_output')
return output
def _schedule_im2col_conv2d(s, op):
"""schedule the Im2Col method for conv2d"""
# get ops and tensors
output = op.output(0)
C = op.input_tensors[0]
A, B = C.op.input_tensors
kernel = A.op.input_tensors[0]
data = B.op.input_tensors[0]
# tuning parameter config
tune_config = getattr(tvm.target.current_target(), "tune_config", None)
if tune_config is None: # use rule
bn = 4
unroll_step = 16
total_work = util.get_const_int(C.shape[0] * C.shape[1])
reduce_work = util.get_const_int(A.shape[1])
if total_work > 200000:
last_work = util.get_const_int(C.shape[1])
if last_work > 10000:
num_thread = 16
elif last_work > 3000:
num_thread = 8
elif reduce_work > 100:
num_thread = 4
else:
num_thread = 2
if reduce_work < 50 and last_work < 30000:
num_thread = 4
elif total_work > 150000:
num_thread = 8
elif total_work > 50000:
num_thread = 4
else:
num_thread = 2
if num_thread == 4:
unroll_step = 2
else:
bn = tune_config["bn"]
num_thread = tune_config["num_thread"]
unroll_step = tune_config["unroll_step"]
bna = bnb = bn
num_thread1 = num_thread2 = num_thread
if data.dtype == 'float16':
bnb *= 2
last_work = util.get_const_int(C.shape[1])
if last_work % (bnb * num_thread2) != 0:
num_thread1 = num_thread * 2
num_thread2 = num_thread // 2
# schedule dilation
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
return s
# schedule padding
if isinstance(data.op, tvm.tensor.ComputeOp) and "pad" in data.op.tag:
data_pad = data
s[data_pad].compute_inline()
##### WINOGRAD TEMPLATE #####
def _pick_tile_size(data, kernel):
N, CI, H, W = get_const_tuple(data.shape)
##### SCHEDULE A #####
if util.get_const_int(kernel.shape[2]) == 1 and util.get_const_int(kernel.shape[3]) == 1:
s[A].compute_inline()
if H % 4 == 0:
return 4
else:
y, x = s[A].op.axis
yo, xo, yi, xi = s[A].tile(y, x, bna, util.get_const_int(kernel.shape[3]))
s[A].vectorize(xi)
fuse_and_bind(s, A, [yo, xo])
# pack to vector form
packedA = pack_tensor(s, A, bna, [C])
# vectorize load
y, x = s[packedA].op.axis[:2]
tmp = s.cache_write(packedA, "local")
x, xt = s[packedA].split(x, bna)
_, _, _, xi = tile_and_bind(s, packedA, y, x, num_thread)
s[tmp].compute_at(s[packedA], xi)
s[tmp].vectorize(s[tmp].op.axis[1])
s[tmp].unroll(s[tmp].op.axis[2])
s[packedA].vectorize(s[packedA].op.axis[2])
s[packedA].unroll(xt)
##### SCHEDULE B #####
y, x = s[B].op.axis
yo, xo, yi, xi = s[B].tile(y, x, 1, 1 * bnb)
fuse_and_bind(s, B, [yo, xo])
# transpose and pack to vector form
B_transpose, B_tmp = transpose(s, B, [C])
s[B_transpose].compute_inline()
packedB = pack_tensor(s, B_transpose, bnb, [B_tmp])
# vectorize load
s[packedB].vectorize(s[packedB].op.axis[2])
y, x = s[packedB].op.axis[:2]
tile_and_bind(s, packedB, y, x, num_thread)
##### SCHEDULE C #####
# vectorize and unroll dot
y, x = s[C].op.axis
y, x, yt, xt = s[C].tile(y, x, bna, bnb)
k = s[C].op.reduce_axis[0]
s[C].reorder(k, yt, xt)
if unroll_step != 1:
k, k_unroll = s[C].split(k, unroll_step)
s[C].unroll(k_unroll)
s[C].unroll(yt)
s[C].vectorize(xt)
tile_and_bind(s, C, y, x, num_thread1, num_thread2)
##### COPY TO OUTPUT #####
if output.op in s.outputs: # no bias
output = output
else: # has bias
s[output].compute_inline()
output = s.outputs[0]
n, co, h, w = s[output].op.axis
h, w, vh, vw = s[output].tile(h, w, 1, bnb)
s[output].unroll(vh)
if util.get_const_int(s[output].op.output(0).shape[3]) % bnb != 0:
pass
return 2
@conv2d_mali.register('winograd')
def decl_winograd(cfg, data, kernel, strides, padding, layout, out_dtype):
tile_size = _pick_tile_size(data, kernel)
return _decl_winograd(cfg, data, kernel, strides, padding, layout, out_dtype, tile_size)
def _decl_winograd(cfg, data, kernel, strides, padding, layout, out_dtype, tile_size):
N, CI, IH, IW = get_const_tuple(data.shape)
if len(kernel.shape) == 4:
pre_computed = False
CO, _, KH, KW = get_const_tuple(kernel.shape)
else:
s[output].vectorize(vw)
fuse_and_bind(s, output, [n, co, h, w])
def _decl_winograd(data, kernel, stride, padding, layout, out_dtype):
"""declare winograd fast convolution F(2x2, 3x3) for conv2d"""
N, CI, H, W = [util.get_const_int(x) for x in data.shape]
CO, CI, KH, KW = [util.get_const_int(x) for x in kernel.shape]
pre_computed = True
H_CAT, W_CAT, CO, CI, VC = get_const_tuple(kernel.shape)
CO *= VC
KH, KW = H_CAT - tile_size + 1, W_CAT - tile_size + 1
HSTR, WSTR = strides if isinstance(strides, (tuple, list)) else (strides, strides)
HPAD, WPAD, _, _ = get_pad_tuple(padding, kernel)
if isinstance(stride, (tuple, list)):
HSTR, WSTR = stride
else:
HSTR, WSTR = stride, stride
assert HSTR == 1 and WSTR == 1 and HPAD == 1 and WPAD == 1 and KH == 3 and KW == 3
assert layout == 'NCHW'
assert KH == 3 and KW == 3 and HPAD == 1 and WPAD == 1 and HSTR == 1 and WSTR == 1
data_pad = pad(data, (0, 0, HPAD, WPAD), name="data_pad")
B_data = np.array([
[1, 0, 0, 0],
[0, 1, -1, 1],
[-1, 1, 1, 0],
[0, 0, 0, -1]
], out_dtype)
G_data = np.array([
[1, 0, 0],
[1.0/2, 1.0/2, 1.0/2],
[1.0/2, -1.0/2, 1.0/2],
[0, 0, 1],
], out_dtype)
A_data = np.array([
[1, 0],
[1, 1],
[1, -1],
[0, -1],
], out_dtype)
m = 2
if tile_size == 4:
G_data = np.array([
[1 / 4.0, 0, 0],
[-1 / 6.0, -1 / 6.0, -1 / 6.0],
[-1 / 6.0, 1 / 6.0, -1 / 6.0],
[1 / 24.0, 1 / 12.0, 1 / 6.0],
[1 / 24.0, -1 / 12.0, 1 / 6.0],
[0, 0, 1]], dtype=np.float32)
B_data = np.array([
[4, 0, 0, 0, 0, 0],
[0, -4, 4, -2, 2, 4],
[-5, -4, -4, -1, -1, 0],
[0, 1, -1, 2, -2, -5],
[1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 1]], out_dtype)
A_data = np.array([
[1, 0, 0, 0],
[1, 1, 1, 1],
[1, -1, 1, -1],
[1, 2, 4, 8],
[1, -2, 4, -8],
[0, 0, 0, 1]], out_dtype)
elif tile_size == 2:
G_data = np.array([
[1, 0, 0],
[1.0/2, 1.0/2, 1.0/2],
[1.0/2, -1.0/2, 1.0/2],
[0, 0, 1]], np.float32)
B_data = np.array([
[1, 0, 0, 0],
[0, 1, -1, 1],
[-1, 1, 1, 0],
[0, 0, 0, -1]], out_dtype)
A_data = np.array([
[1, 0],
[1, 1],
[1, -1],
[0, -1]], out_dtype)
else:
raise ValueError("Unsupported tile size for winograd: " + str(tile_size))
m = A_data.shape[1]
r = 3
alpha = m + r - 1
K = CO
C = CI
H = (IH + 2 * HPAD - 3) // HSTR + 1
W = (IW + 2 * WPAD - 3) // WSTR + 1
nH, nW = (H + m-1) // m, (W + m-1) // m
P = N * nH * nW
bna, bnb = 4, 4
if data.dtype == 'float16':
bnb *= 2
##### space definition begin #####
tile_bna_candidates = [1, 2, 4, 8, 16]
factors = get_factors(CO)
cfg.define_knob('tile_bna', [x for x in tile_bna_candidates if x in factors])
cfg.define_knob('tile_bnb', [1, 2, 4, 8, 16])
cfg.define_split('tile_t1', CI, num_outputs=2, max_factor=128)
cfg.define_split('tile_t2', CO, num_outputs=2, max_factor=128)
cfg.define_split('c_unroll', CI, num_outputs=2, max_factor=8)
cfg.define_knob('yt', [1, 2, 4, 8, 16, 32])
##### space definition end #####
if cfg.is_fallback:
cfg['tile_bnb'].val = 4
cfg['tile_bna'].val = 4
while CO % cfg['tile_bna'].val != 0:
cfg['tile_bna'].val //= 2
cfg['yt'].val = 8
cfg.fallback_split('tile_t1', [-1, 128])
cfg.fallback_split('tile_t2', [-1, 128])
cfg.fallback_split('c_unroll', [-1, 8])
bna = cfg['tile_bna'].val
bnb = cfg['tile_bnb'].val
P_round = (P + bnb - 1) // bnb * bnb
assert K % bna == 0 and P_round % bnb == 0
assert CO % bna == 0 and P_round % bnb == 0
# pack input tile
input_tile = tvm.compute((C, P_round // bnb, alpha, alpha, bnb),
lambda c, b, eps, nu, bb:
tvm.select(b * bnb + bb < P,\
data_pad[(b*bnb+bb) // (nH*nW)][c][(b*bnb+bb) // nW % nH * m + eps]\
[(b*bnb+bb) % nW * m + nu], tvm.const(0, data_pad.dtype)),
name='d')
input_tile = tvm.compute((CI, P_round // bnb, alpha, alpha, bnb), lambda ci, b, eps, nu, bb: \
tvm.select(b * bnb + bb < P,
data_pad[(b*bnb+bb) // (nH*nW)][ci][(b*bnb+bb) // nW % nH * m + eps]
[(b*bnb+bb) % nW * m + nu], tvm.const(0, data_pad.dtype)), name='d')
# transform kernel
G = const_array(G_data, 'G')
r_kh = tvm.reduce_axis((0, KH), 'r_kh')
r_kw = tvm.reduce_axis((0, KW), 'r_kw')
U = tvm.compute((alpha, alpha, K // bna, C, bna), lambda eps, nu, k, c, kk:
tvm.sum(kernel[k * bna + kk][c][r_kh][r_kw] * G[eps][r_kh] * G[nu][r_kw],
axis=[r_kh, r_kw]), name='U')
if pre_computed:
U = kernel
else:
G = const_matrix(G_data, 'G')
r_kh = tvm.reduce_axis((0, KH), 'r_kh')
r_kw = tvm.reduce_axis((0, KW), 'r_kw')
U = tvm.compute((alpha, alpha, CO // bna, CI, bna), lambda eps, nu, co, ci, vco:
tvm.sum(kernel[co * bna + vco][ci][r_kh][r_kw] * G[eps][r_kh] * G[nu][r_kw],
axis=[r_kh, r_kw]), name='U')
# transform image
B = const_array(B_data, 'B')
r_eps = tvm.reduce_axis((0, alpha), 'r_eps')
r_nu = tvm.reduce_axis((0, alpha), 'r_nu')
V = tvm.compute((alpha, alpha, P_round // bnb, C, bnb), lambda eps, nu, b, c, bb:
tvm.sum(input_tile[c][b][r_eps][r_nu][bb] * B[r_eps][eps] * B[r_nu][nu],
axis=[r_eps, r_nu]), name='V')
B = const_matrix(B_data, 'B')
r_a = tvm.reduce_axis((0, alpha), 'r_a')
r_b = tvm.reduce_axis((0, alpha), 'r_b')
V = tvm.compute((alpha, alpha, P_round // bnb, CI, bnb), lambda eps, nu, p, ci, vp:
tvm.sum(input_tile[ci][p][r_a][r_b][vp] * B[r_a][eps] * B[r_b][nu],
axis=[r_a, r_b]), name='V')
# batch gemm
c = tvm.reduce_axis((0, C), name='c')
M = tvm.compute((alpha, alpha, K, P_round), lambda eps, nu, k, b:
tvm.sum(U[eps][nu][k // bna][c][k % bna] *
V[eps][nu][b // bnb][c][b % bnb], axis=c), name='M')
# inverse transform
A = const_array(A_data, 'A')
r_eps = tvm.reduce_axis((0, alpha), 'r_eps')
r_nu = tvm.reduce_axis((0, alpha), 'r_nu')
Y = tvm.compute((K, P, m, m), lambda k, b, vh, vw:
tvm.sum(M[r_eps][r_nu][k][b] * A[r_eps][vh] * A[r_nu][vw],
axis=[r_eps, r_nu]), name='Y')
ci = tvm.reduce_axis((0, CI), name='c')
M = tvm.compute((alpha, alpha, CO, P_round), lambda eps, nu, co, p:
tvm.sum(U[eps][nu][co // bna][ci][co % bna] *
V[eps][nu][p // bnb][ci][p % bnb], axis=ci), name='M')
A = const_matrix(A_data, 'A')
r_a = tvm.reduce_axis((0, alpha), 'r_a')
r_b = tvm.reduce_axis((0, alpha), 'r_b')
Y = tvm.compute((CO, P, m, m), lambda co, p, vh, vw:
tvm.sum(M[r_a][r_b][co][p] * A[r_a][vh] * A[r_b][vw],
axis=[r_a, r_b]), name='Y')
# unpack output
output = tvm.compute((N, K, H, W), lambda n, k, h, w:
Y[k][n * nH * nW + (h//m) * nW + w//m][h % m][w % m]
output = tvm.compute((N, CO, H, W), lambda n, co, h, w:
Y[co][n * nH * nW + (h//m) * nW + w//m][h % m][w % m]
# thw following term is used to make the padding effective,
# otherwise the padding will be eliminated by bound inference
+ tvm.const(0, out_dtype) * M[alpha-1][alpha-1][K-1][P_round-1],
name='output', tag='winograd_conv_output')
+ tvm.const(0, out_dtype) * M[alpha-1][alpha-1][CO-1][P_round-1],
name='output', tag='winograd_conv2d_output',
attrs={'workload': _winograd_conv_arg_to_workload(
data, kernel, strides, padding, layout, out_dtype, tile_size)})
# we have to manually assign effective GFLOP for winograd
cfg.add_flop(2 * N * CO * H * W * KH * KW * CI)
return output
def _schedule_winograd(s, op):
def _schedule_winograd(cfg, s, op):
"""schedule winograd fast convolution F(2x2, 3x3) for conv2d"""
# get ops and tensors
output = op.output(0)
Y = op.input_tensors[0]
M, A = s[Y].op.input_tensors
U, V = s[M].op.input_tensors
kernel, G = s[U].op.input_tensors
d, B = s[V].op.input_tensors
data_pad = s[d].op.input_tensors[0]
data = s[data_pad].op.input_tensors[0]
# dilation
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
# padding
s[data_pad].compute_inline()
# pack input tiles
c, b, eps, nu, bb = s[d].op.axis
s[d].reorder(eps, nu, bb)
aha = s[d].fuse(eps, nu)
s[d].unroll(bb)
tile_and_bind3d(s, d, c, b, aha, 4, 1, 1)
# transform kernel
s[G].compute_inline()
eps, nu, k, c, kk, = s[U].op.axis
r_kh, r_kw = s[U].op.reduce_axis
s[U].reorder(k, c, kk, eps, nu, r_kh, r_kw)
_ = [s[U].unroll(x) for x in [eps, nu, r_kh, r_kw]]
s[U].vectorize(kk)
tile_and_bind(s, U, k, c, 1, 256)
if isinstance(U.op, tvm.tensor.ComputeOp):
kernel, G = s[U].op.input_tensors
s[G].compute_inline()
eps, nu, co, ci, vco, = s[U].op.axis
if autotvm.GLOBAL_SCOPE.in_tuning:
# kernel transformation will be pre-computed during compilation, so we skip
# this part to make tuning records correct
s[U].pragma(eps, 'debug_skip_region')
else:
r_kh, r_kw = s[U].op.reduce_axis
s[U].reorder(co, ci, eps, nu, r_kh, r_kw, vco)
_ = [s[U].unroll(x) for x in [eps, nu, r_kh, r_kw]]
s[U].vectorize(vco)
tile_and_bind(s, U, co, ci, 1, 256)
# dilation
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
# transform image
s[B].compute_inline()
eps, nu, b, c, bb = s[V].op.axis
r_eps, r_nu = s[V].op.reduce_axis
s[V].reorder(b, c, bb, eps, nu, r_nu, r_eps)
_ = [s[V].unroll(x) for x in [eps, nu, r_eps, r_nu]]
s[V].vectorize(bb)
tile_and_bind(s, V, b, c, 2, 1)
VL = s.cache_write(V, 'local')
eps, nu, p, ci, vp = s[V].op.axis
s[V].reorder(p, ci, eps, nu, vp)
for axis in [eps, nu]:
s[V].unroll(axis)
s[V].vectorize(vp)
fused = s[V].fuse(p, ci)
bb, tt = cfg['tile_t1'].apply(s, V, fused)
s[V].bind(bb, tvm.thread_axis('blockIdx.x'))
s[V].bind(tt, tvm.thread_axis('threadIdx.x'))
eps, nu, p, ci, vp = s[VL].op.axis
r_a, r_b = s[VL].op.reduce_axis
for axis in [eps, nu, r_a, r_b]:
s[VL].unroll(axis)
s[VL].vectorize(vp)
s[d].compute_at(s[V], tt)
s[VL].compute_at(s[V], tt)
# batch gemm
bna, bnb = 4, 4
if data.dtype == 'float16':
bnb *= 2
bna = cfg['tile_bna'].val
bnb = cfg['tile_bnb'].val
eps, nu, k, b = s[M].op.axis
alpha = eps.dom.extent
c = s[M].op.reduce_axis[0]
yo, xo, yi, xi = s[M].tile(k, b, bna, bnb)
s[M].reorder(c, yi, xi)
c, c_unroll = s[M].split(c, 2)
c, c_unroll = cfg['c_unroll'].apply(s, M, c)
s[M].reorder(yo, xo, c, c_unroll, yi, xi)
s[M].unroll(c_unroll)
s[M].unroll(yi)
s[M].vectorize(xi)
z = s[M].fuse(eps, nu)
tile_and_bind3d(s, M, z, yo, xo, 1, 8, 1)
tile_and_bind3d(s, M, z, yo, xo, 1, cfg['yt'].val, 1)
# inverse transform
s[A].compute_inline()
k, b, vh, vw = s[Y].op.axis
r_eps, r_nu = s[Y].op.reduce_axis
_ = [s[Y].unroll(x) for x in [vh, vw, r_eps, r_nu]]
tile_and_bind(s, Y, k, b, 4, 1)
r_a, r_b = s[Y].op.reduce_axis
for axis in [vh, vw, r_a, r_b]:
s[Y].unroll(axis)
# schedule output
if output.op in s.outputs: # no bias
output = output
else: # has bias
# schedule output and fusion
if output.op not in s.outputs:
s[output].compute_inline()
output = s.outputs[0]
_, k, h, w = s[output].op.axis
tile_and_bind3d(s, output, k, h, w, 1, 2, 2)
n, co, h, w = s[output].op.axis
m = alpha - 3 + 1
h, w, hi, wi = s[output].tile(h, w, m, m)
s[output].unroll(hi)
s[output].unroll(wi)
fused = s[output].fuse(n, co, h, w)
bb, tt = cfg['tile_t2'].apply(s, output, fused)
s[output].bind(bb, tvm.thread_axis('blockIdx.x'))
s[output].bind(tt, tvm.thread_axis('threadIdx.x'))
s[Y].compute_at(s[output], tt)
##### REGISTER TOPI COMPUTE / SCHEDULE FOR WINOGRAD WITH WEIGHT TRANSFORM #####
@conv2d_winograd_without_weight_transform.register(['mali'])
@autotvm.task.dispatcher
def winograd_ww_config_dispatcher_(data, kernel, strides, padding, layout, out_dtype, tile_size):
return _winograd_conv_arg_to_workload(data, kernel, strides, padding, layout, out_dtype,
tile_size)
@winograd_ww_config_dispatcher_.register(['winograd'])
def decl_winograd_ww(cfg, data, kernel, strides, padding, layout, out_dtype, tile_size):
return _decl_winograd(cfg, data, kernel, strides, padding, layout, out_dtype,
tile_size)
@autotvm.task.register_topi_schedule(schedule_conv2d_winograd_without_weight_transform,
'mali', ['winograd'])
def schedule_conv2d_winograd_without_weight_transform_(cfg, outs):
"""TOPI schedule callback"""
s = tvm.create_schedule([x.op for x in outs])
def _callback(op):
if 'winograd_conv2d_output' in op.tag:
_schedule_winograd(cfg, s, op)
traverse_inline(s, outs[0].op, _callback)
return s
##### SCHECULE UTILITIES #####
def tile_and_bind(s, tensor, y, x, y_factor, x_factor=None):
""" tile and bind to GPU threads """
x_factor = x_factor or y_factor
yo, xo, yi, xi = s[tensor].tile(y, x, y_factor, x_factor)
s[tensor].bind(xo, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(xi, tvm.thread_axis("threadIdx.x"))
s[tensor].bind(yo, tvm.thread_axis("blockIdx.y"))
s[tensor].bind(yi, tvm.thread_axis("threadIdx.y"))
return yo, xo, yi, xi
def tile_and_bind3d(s, tensor, z, y, x, z_factor=2, y_factor=None, x_factor=None):
""" tile and bind 3d """
y_factor = y_factor or z_factor
x_factor = x_factor or y_factor
zo, zi = s[tensor].split(z, z_factor)
yo, yi = s[tensor].split(y, y_factor)
xo, xi = s[tensor].split(x, x_factor)
s[tensor].bind(zo, tvm.thread_axis("blockIdx.z"))
s[tensor].bind(zi, tvm.thread_axis("threadIdx.z"))
s[tensor].bind(yo, tvm.thread_axis("blockIdx.y"))
s[tensor].bind(yi, tvm.thread_axis("threadIdx.y"))
s[tensor].bind(xo, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(xi, tvm.thread_axis("threadIdx.x"))
s[tensor].reorder(zo, yo, xo, zi, yi, xi)
return zo, yo, xo, zi, yi, xi
......@@ -4,17 +4,21 @@
from __future__ import absolute_import as _abs
import tvm
from tvm import autotvm
from .. import generic
from .. import util
from .. import tag
from .. import generic, nn
from ..util import traverse_inline
@generic.schedule_dense.register(["mali"])
def schedule_dense(outs):
autotvm.register_topi_compute(nn.dense, 'mali', 'direct', nn.dense.fdefault)
@autotvm.register_topi_schedule(generic.schedule_dense, 'mali', 'direct')
def schedule_dense(cfg, outs):
"""Schedule for dense operator.
Parameters
----------
cfg: ConfigEntity
The config entity for this template
outs: Array of Tensor
The computation graph description of dense
in the format of an array of tensors.
......@@ -26,80 +30,65 @@ def schedule_dense(outs):
"""
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
s = tvm.create_schedule([x.op for x in outs])
def _schedule(dense):
data = s[dense].op.input_tensors[0]
weight = s[dense].op.input_tensors[1]
hidden = util.get_const_int(weight.shape[1])
out = util.get_const_int(weight.shape[0])
# set tunable parameter
tune_config = getattr(tvm.target.current_target(), "tune_config", None)
if tune_config is None:
if hidden > 8192:
num_thread = 32
unroll_step = 32
else:
if out <= 1024:
num_thread = 32
unroll_step = 16
else:
num_thread = 256
unroll_step = 32
if data.dtype == 'float16':
if hidden > 8192:
num_thread = 2
unroll_step = 32
else:
num_thread = 8
unroll_step = 256
else:
num_thread = tune_config['num_thread']
unroll_step = tune_config['unroll_step']
def fuse_and_bind(s, tensor, axis=None, num_thread=None):
""" fuse all the axis and bind to GPU threads """
axis = axis or s[tensor].op.axis
fused = s[tensor].fuse(*axis)
max_threads = tvm.target.current_target(allow_none=False).max_num_threads
bx, tx = s[tensor].split(fused, num_thread or max_threads)
s[tensor].bind(bx, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(tx, tvm.thread_axis("threadIdx.x"))
return bx, tx
output = outs[0]
bx, tx = fuse_and_bind(s, output, num_thread=num_thread)
k = s[dense].op.reduce_axis[0]
k, k_unroll = s[dense].split(k, unroll_step)
s[dense].unroll(k_unroll)
if dense.op not in s.outputs:
def _callback(op):
if op.tag == 'dense':
vec_size = [1, 2, 4, 8, 16]
max_unroll = 32
dense = op.output(0)
output = outs[0]
y, x = s[output].op.axis
c = s[dense].op.reduce_axis[0]
##### space definition begin #####
cfg.define_split('tile_y', y, num_outputs=3)
cfg.define_split('tile_x', x, num_outputs=3)
cfg.define_split('c_unroll', c, num_outputs=2, max_factor=64)
# fallback support
if cfg.is_fallback:
ref_log = autotvm.tophub.load_reference_log(
'mali', 'rk3399', 'dense', 'direct')
cfg.fallback_with_reference_log(ref_log)
##### space definition end #####
if dense.op in s.outputs:
dense = s.cache_write(output, 'local')
by, ty, yi = cfg['tile_y'].apply(s, output, y)
bx, tx, xi = cfg['tile_x'].apply(s, output, x)
s[output].bind(by, tvm.thread_axis('blockIdx.y'))
s[output].bind(bx, tvm.thread_axis('blockIdx.x'))
s[output].bind(ty, tvm.thread_axis('threadIdx.y'))
s[output].bind(tx, tvm.thread_axis('threadIdx.x'))
if cfg['tile_y'].size[-1] < max_unroll:
s[output].unroll(yi)
if cfg['tile_x'].size[-1] in vec_size:
s[output].vectorize(xi)
s[dense].compute_at(s[output], tx)
# bias = s[outs[0]].op.input_tensors[1]
# print(tvm.lower(s, [data, weight, bias, outs[0]], simple_mode=True))
scheduled_ops = []
def traverse(OP):
"""Internal travserse function"""
# inline all one-to-one-mapping operators except the last stage (output)
if tag.is_broadcast(OP.tag):
if OP not in s.outputs:
s[OP].compute_inline()
for tensor in OP.input_tensors:
if tensor.op.input_tensors and tensor.op not in scheduled_ops:
traverse(tensor.op)
# schedule dense
elif OP.tag == 'dense':
dense = OP.output(0)
_schedule(dense)
else:
raise RuntimeError("Unsupported operator: %s" % OP.tag)
scheduled_ops.append(OP)
traverse(outs[0].op)
k = s[dense].op.reduce_axis[0]
y, x = s[dense].op.axis
k, k_unroll = cfg['c_unroll'].apply(s, dense, k)
s[dense].reorder(k, k_unroll, y, x)
s[dense].unroll(k_unroll)
if cfg['tile_y'].size[-1] < max_unroll:
s[dense].unroll(y)
if cfg['tile_x'].size[-1] in vec_size:
s[dense].vectorize(x)
traverse_inline(s, outs[0].op, _callback)
return s
def fuse_and_bind(s, tensor, axis=None, num_thread=None):
""" fuse all the axis and bind to GPU threads """
axis = axis or s[tensor].op.axis
fused = s[tensor].fuse(*axis)
bx, tx = s[tensor].split(fused, num_thread)
s[tensor].bind(bx, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(tx, tvm.thread_axis("threadIdx.x"))
return bx, tx
# pylint: disable=invalid-name,unused-variable,unused-argument
"""depthwise_conv2d schedule on ARM Mali GPU"""
from __future__ import absolute_import as _abs
import tvm
from tvm import autotvm
from .. import generic
from .. import util
from .. import tag
from ..generic import schedule_depthwise_conv2d_nchw
from ..nn import depthwise_conv2d_nchw
from ..util import traverse_inline
@generic.schedule_depthwise_conv2d_nchw.register(["mali"])
def schedule_depthwise_conv2d_nchw(outs):
"""Schedule for depthwise_conv2d nchw forward.
# register original implementation of depthwise_conv2d_nchw since we don't need to change this part
autotvm.register_topi_compute(depthwise_conv2d_nchw, 'mali', 'direct',
depthwise_conv2d_nchw.fdefault)
# register customized schedule for arm cpu.
@autotvm.register_topi_schedule(schedule_depthwise_conv2d_nchw, 'mali', 'direct')
def schedule_depthwise_conv2d_nchw_mali(cfg, outs):
"""Schedule depthwise conv2d
Parameters
----------
cfg: ConfigEntity
The configuration of this template
outs: Array of Tensor
The computation graph description of depthwise_conv2d
The computation graph description of depthwise convolution2d
in the format of an array of tensors.
Returns
......@@ -25,89 +32,95 @@ def schedule_depthwise_conv2d_nchw(outs):
"""
outs = [outs] if isinstance(outs, tvm.tensor.Tensor) else outs
s = tvm.create_schedule([x.op for x in outs])
def _schedule(pad_data, kernel, conv):
raw_data = s[pad_data].op.input_tensors[0]
if conv.op not in s.outputs: # has bias or relu
output = outs[0]
else: # no bias or relu
output = conv
def _schedule(pad_data, kernel, conv):
"""schedule depthwise_conv2d"""
max_unroll = 16
vec_size = [1, 2, 4, 8, 16]
def tile_and_bind3d(tensor, z, y, x, z_factor=2, y_factor=None, x_factor=None):
""" tile and bind 3d """
y_factor = y_factor or z_factor
x_factor = x_factor or y_factor
zo, zi = s[tensor].split(z, z_factor)
yo, yi = s[tensor].split(y, y_factor)
xo, xi = s[tensor].split(x, x_factor)
s[tensor].bind(zo, tvm.thread_axis("blockIdx.z"))
s[tensor].bind(zi, tvm.thread_axis("threadIdx.z"))
s[tensor].bind(yo, tvm.thread_axis("blockIdx.y"))
s[tensor].bind(yi, tvm.thread_axis("threadIdx.y"))
s[tensor].bind(xo, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(xi, tvm.thread_axis("threadIdx.x"))
return zo, zi, yo, yi, xo, xi
# set tunable parameters
VH = 1
VW = 1
num_thread = 4
while util.get_const_int(conv.shape[3]) % (VW * 2) == 0 and VW * 2 <= 4:
VW = VW * 2
while util.get_const_int(conv.shape[2]) % (VH * 2) == 0 and VH * 2 <= 2:
VH = VH * 2
if raw_data.dtype == 'float16':
if util.get_const_int(conv.shape[3]) % (VW * 2) == 0:
VW *= 2
num_thread *= 2
else:
num_thread *= 2
##### space definition begin #####
n, c, y, x = s[conv].op.axis
bc, tc, ci = cfg.define_split("tile_c", c, num_outputs=3)
by, ty, yi = cfg.define_split('tile_y', y, num_outputs=3)
bx, tx, xi = cfg.define_split("tile_x", x, num_outputs=3)
cfg.define_annotate('ann_spatial', [ci, yi, xi], policy='try_unroll_vec')
# schedule padding
_, c, y, x = s[pad_data].op.axis
tile_and_bind3d(pad_data, c, y, x, num_thread, 1, 1)
# fallback support
if cfg.is_fallback:
ref_log = autotvm.tophub.load_reference_log(
'mali', 'rk3399', 'depthwise_conv2d_nchw', 'direct')
cfg.fallback_with_reference_log(ref_log)
###### space definition end ######
# schedule conv
di, dj = s[conv].op.reduce_axis
s[conv].unroll(di)
s[conv].unroll(dj)
_, c, y, x = s[output].op.axis
y, x, yi, xi = s[output].tile(y, x, VH, VW)
s[output].unroll(yi)
s[output].vectorize(xi)
# schedule padding
n, c, y, x = s[pad_data].op.axis
tile_and_bind3d(s, pad_data, c, y, x, cfg["tile_c"].size[1], 1, 1)
_, _, _, _, _, ji = tile_and_bind3d(output, c, y, x, num_thread, 1, 1)
# schedule dilation
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
s[kernel].compute_inline()
# schedule conv
if conv.op not in s.outputs:
_, c, y, x = s[conv].op.axis
y, x, yi, xi = s[conv].tile(y, x, VH, VW)
s[conv].unroll(yi)
s[conv].vectorize(xi)
s[conv].compute_at(s[output], ji)
scheduled_ops = []
def traverse(op):
"""Internal travserse function"""
# inline all one-to-one-mapping operators except the last stage (output)
if tag.is_broadcast(op.tag):
if op not in s.outputs:
s[op].compute_inline()
for tensor in op.input_tensors:
if tensor.op.input_tensors and tensor.op not in scheduled_ops:
traverse(tensor.op)
s[conv].set_scope('local')
OL = conv
output = s.outputs[0].output(0)
else:
OL = s.cache_write(conv, 'local')
output = conv
n, c, y, x = s[output].op.axis
bc, tc, ci = cfg['tile_c'].apply(s, output, c)
by, ty, yi = cfg['tile_y'].apply(s, output, y)
bx, tx, xi = cfg['tile_x'].apply(s, output, x)
bc = s[output].fuse(n, bc)
s[output].bind(bc, tvm.thread_axis("blockIdx.z"))
s[output].bind(tc, tvm.thread_axis("threadIdx.z"))
s[output].bind(by, tvm.thread_axis("blockIdx.y"))
s[output].bind(ty, tvm.thread_axis("threadIdx.y"))
s[output].bind(bx, tvm.thread_axis("blockIdx.x"))
s[output].bind(tx, tvm.thread_axis("threadIdx.x"))
di, dj = s[OL].op.reduce_axis
s[OL].unroll(di)
s[OL].unroll(dj)
s[OL].compute_at(s[output], tx)
n, ci, yi, xi = s[OL].op.axis
cfg["ann_spatial"].apply(s, OL, [ci, yi, xi],
axis_lens=[cfg['tile_c'].size[2], cfg['tile_y'].size[2],
cfg['tile_x'].size[2]],
max_unroll=max_unroll,
vec_size=vec_size,
cfg=cfg)
def _callback(op):
"""traverse to find op to schedule"""
# schedule depthwise_conv2d
if op.tag == 'depthwise_conv2d_nchw':
pad_data = op.input_tensors[0]
kernel = op.input_tensors[1]
if isinstance(kernel.op, tvm.tensor.ComputeOp) and 'dilate' in kernel.op.tag:
s[kernel].compute_inline()
conv = op.output(0)
_schedule(pad_data, kernel, conv)
scheduled_ops.append(op)
traverse(outs[0].op)
traverse_inline(s, outs[0].op, _callback)
return s
def tile_and_bind3d(s, tensor, z, y, x, z_factor=2, y_factor=None, x_factor=None):
""" tile and bind 3d """
y_factor = y_factor or z_factor
x_factor = x_factor or y_factor
zo, zi = s[tensor].split(z, z_factor)
yo, yi = s[tensor].split(y, y_factor)
xo, xi = s[tensor].split(x, x_factor)
s[tensor].bind(zo, tvm.thread_axis("blockIdx.z"))
s[tensor].bind(zi, tvm.thread_axis("threadIdx.z"))
s[tensor].bind(yo, tvm.thread_axis("blockIdx.y"))
s[tensor].bind(yi, tvm.thread_axis("threadIdx.y"))
s[tensor].bind(xo, tvm.thread_axis("blockIdx.x"))
s[tensor].bind(xi, tvm.thread_axis("threadIdx.x"))
return zo, zi, yo, yi, xo, xi
# pylint: disable=invalid-name, no-member, too-many-locals, too-many-statements, too-many-arguments, too-many-branches, line-too-long
"""Compute and schedule for rocm conv2d_nchw with auto fusion"""
# pylint: disable=invalid-name
"""Compute definition for conv2d with rocm backend"""
import tvm
from tvm import autotvm
from tvm.contrib import miopen
import topi
from .. import generic
from ..nn.conv2d import conv2d
from ..util import get_const_int
from .. import nn, generic
from ..util import get_const_int, get_const_tuple
from ..cuda.conv2d import conv2d_cuda, schedule_conv2d_nchw_cuda
@conv2d.register("rocm")
def conv2d_rocm(data, kernel, stride, padding, layout='NCHW', out_dtype='float32'):
@autotvm.register_topi_compute(nn.conv2d, 'rocm', ['direct', 'winograd'])
def conv2d_rocm(cfg, data, kernel, strides, padding, layout='NCHW', out_dtype='float32'):
"""Conv2D operator for rocm backend.
Parameters
----------
cfg: ConfigEntity
The config for this template
input : tvm.Tensor
4-D with shape [batch, in_channel, in_height, in_width]
filter : tvm.Tensor
4-D with shape [num_filter, in_channel, filter_height, filter_width]
stride : int or a list/tuple of two ints
strides : int or a list/tuple of two ints
stride size, or [stride_height, stride_width]
padding : int or a list/tuple of two ints
......@@ -34,31 +37,42 @@ def conv2d_rocm(data, kernel, stride, padding, layout='NCHW', out_dtype='float32
output : tvm.Tensor
4-D with shape [batch, out_channel, out_height, out_width]
"""
assert layout == 'NCHW', "Only NCHW layout is supported."
assert isinstance(stride, int) or len(stride) == 2
if isinstance(stride, int):
stride_h = stride_w = stride
else:
stride_h, stride_w = stride
if isinstance(padding, int):
pad_h = pad_w = padding
else:
pad_h, pad_w = padding
# handle dilation
dilation_h = dilation_w = 1
kernel_tvm = kernel
kernel_cudnn = kernel
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
kernel_before_dilation = kernel.op.input_tensors[0]
kernel_cudnn = kernel_before_dilation
dilation_h = (get_const_int(kernel.shape[2]) + get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
dilation_w = (get_const_int(kernel.shape[3]) + get_const_int(kernel_before_dilation.shape[3]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
target = tvm.target.current_target()
if "miopen" in target.libs:
assert layout == 'NCHW', "Only NCHW layout is supported."
CO, CI, KH, KW = get_const_tuple(kernel.shape)
N, _, H, W = get_const_tuple(data.shape)
# handle dilation
stride_h, stride_w = (strides, strides) if isinstance(strides, int) else strides
pad_h, pad_w = (padding, padding) if isinstance(padding, int) else padding
OH = (H + 2 * pad_h - KH) // stride_h + 1
OW = (W + 2 * pad_w - KW) // stride_w + 1
cfg.add_flop(2 * N * OH * OW * CO * CI * KH * KW)
dilation_h = dilation_w = 1
kernel_before_dilation = kernel
if isinstance(kernel.op, tvm.tensor.ComputeOp) and "dilate" in kernel.op.tag:
kernel_before_dilation = kernel.op.input_tensors[0]
if layout == 'NCHW':
dilation_h = (get_const_int(kernel.shape[2]) +
get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
dilation_w = (get_const_int(kernel.shape[3]) +
get_const_int(kernel_before_dilation.shape[3]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
elif layout == 'NHWC':
dilation_h = (get_const_int(kernel.shape[1]) +
get_const_int(kernel_before_dilation.shape[1]) - 1) \
// get_const_int(kernel_before_dilation.shape[1])
dilation_w = (get_const_int(kernel.shape[2]) +
get_const_int(kernel_before_dilation.shape[2]) - 1) \
// get_const_int(kernel_before_dilation.shape[2])
return miopen.conv2d_forward(data,
kernel_cudnn,
kernel_before_dilation,
stride_h,
stride_w,
pad_h,
......@@ -66,25 +80,30 @@ def conv2d_rocm(data, kernel, stride, padding, layout='NCHW', out_dtype='float32
dilation_h,
dilation_w,
conv_mode=0)
return topi.nn.conv2d_nchw(data, kernel_tvm, stride, padding, out_dtype)
return conv2d_cuda(cfg, data, kernel, strides, padding, layout, out_dtype)
@generic.schedule_conv2d_nchw.register(["rocm"])
def schedule_conv2d_nchw(outs):
"""Schedule for conv2d_nchw with rocm backend.
@autotvm.register_topi_schedule(generic.schedule_conv2d_nchw, 'rocm', ["direct", 'winograd'])
def schedule_conv2d_nchw_rocm(cfg, outs):
"""TOPI schedule callback of conv2d for rocm
Parameters
----------
cfg: ConfigEntity
The config for this template
outs: Array of Tensor
The computation graph description of conv2d_nchw
The computation graph description of conv2d
in the format of an array of tensors.
Returns
-------
s: Schedule
The computation schedule for conv2d_nchw.
The computation schedule for conv2d.
"""
target = tvm.target.current_target()
if target and "miopen" in target.libs:
return topi.generic.schedule_extern(outs)
return topi.cuda.schedule_conv2d_nchw(outs)
return generic.schedule_extern(outs)
return schedule_conv2d_nchw_cuda(cfg, outs)
......@@ -9,4 +9,4 @@ def get_all_backend():
A list of all supported targets
"""
return ['llvm', 'cuda', 'opencl', 'metal', 'rocm', 'vulkan', 'nvptx',
'llvm -device=arm_cpu', 'aocl_sw_emu']
'llvm -device=arm_cpu', 'opencl -device=mali', 'aocl_sw_emu']
......@@ -48,7 +48,8 @@ def verify_conv2d_nchw(batch, in_channel, in_size, num_filter, kernel, stride, p
print("Running on target: %s" % device)
with tvm.target.create(device):
dW = topi.nn.dilate(W, (1, 1, dilation, dilation))
C = topi.nn.conv2d(A, dW, stride, padding, layout='NCHW', out_dtype=dtype)
C = topi.nn.conv2d(A, dW, (stride, stride), (padding, padding),
layout='NCHW', out_dtype=dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
......@@ -72,7 +73,11 @@ def verify_conv2d_nchw(batch, in_channel, in_size, num_filter, kernel, stride, p
def test_conv2d_nchw():
autotvm.DispatchContext.current.silent = True
# load tophub
ctx = autotvm.apply_history_best([])
for device in get_all_backend():
context = autotvm.tophub.context(device)
context.__enter__()
# ResNet18 workloads
verify_conv2d_nchw(1, 3, 224, 64, 7, 2, 3)
......@@ -96,9 +101,21 @@ def test_conv2d_nchw():
# dilation = 2
verify_conv2d_nchw(1, 64, 56, 64, 3, 1, 1, dilation=2)
# batch size
verify_conv2d_nchw(4, 64, 56, 64, 3, 1, 1)
verify_conv2d_nchw(9, 64, 56, 64, 3, 1, 1)
# weird workloads
verify_conv2d_nchw(1, 1, 1, 1, 1, 1, 1, dilation=1)
verify_conv2d_nchw(1, 1, 1, 1, 1, 1, 1, dilation=2)
verify_conv2d_nchw(2, 2, 2, 2, 2, 2, 2)
verify_conv2d_nchw(3, 3, 3, 3, 3, 3, 3)
verify_conv2d_nchw(4, 4, 4, 4, 4, 4, 4)
verify_conv2d_nchw(5, 5, 5, 5, 5, 5, 5)
verify_conv2d_nchw(6, 6, 6, 6, 6, 6, 6)
# disable these tests due to some bugs of llvm with nvptx
# verify_conv2d_nchw(1, 1, 1, 1, 1, 1, 1, dilation=1)
# verify_conv2d_nchw(1, 1, 1, 1, 1, 1, 1, dilation=2)
# verify_conv2d_nchw(2, 13, 71, 59, 3, 1, 1)
# inception v3 workloads
verify_conv2d_nchw(1, 3, 299, 32, 3, 2, 0)
......@@ -117,22 +134,22 @@ def test_conv2d_nchw():
verify_conv2d_nchw(1, 288, 35, 64, 1, 1, 0)
verify_conv2d_nchw(1, 288, 35, 48, 1, 1, 0)
verify_conv2d_nchw(1, 288, 35, 384, 3, 2, 0)
# verify_conv2d_nchw(1, 96, 35, 96, 3, 2, 0)
# verify_conv2d_nchw(1, 768, 17, 192, 1, 1, 0)
# verify_conv2d_nchw(1, 768, 17, 128, 1, 1, 0)
# verify_conv2d_nchw(1, 128, 17, 128, 1, 1, 0)
# verify_conv2d_nchw(1, 128, 17, 192, 7, 1, 3)
# verify_conv2d_nchw(1, 128, 17, 128, 7, 1, 3)
# verify_conv2d_nchw(1, 128, 17, 192, 1, 1, 0)
# verify_conv2d_nchw(1, 768, 17, 160, 1, 1, 0)
# verify_conv2d_nchw(1, 160, 17, 160, 1, 1, 0)
# verify_conv2d_nchw(1, 160, 17, 192, 7, 1, 3)
# verify_conv2d_nchw(1, 160, 17, 160, 7, 1, 3)
# verify_conv2d_nchw(1, 160, 17, 192, 1, 1, 0)
# verify_conv2d_nchw(1, 192, 17, 192, 1, 1, 0)
# verify_conv2d_nchw(1, 192, 17, 192, 7, 1, 3)
# verify_conv2d_nchw(1, 192, 17, 320, 3, 2, 0)
# verify_conv2d_nchw(1, 192, 17, 192, 3, 2, 0)
verify_conv2d_nchw(1, 96, 35, 96, 3, 2, 0)
verify_conv2d_nchw(1, 768, 17, 192, 1, 1, 0)
verify_conv2d_nchw(1, 768, 17, 128, 1, 1, 0)
verify_conv2d_nchw(1, 128, 17, 128, 1, 1, 0)
verify_conv2d_nchw(1, 128, 17, 192, 7, 1, 3)
verify_conv2d_nchw(1, 128, 17, 128, 7, 1, 3)
verify_conv2d_nchw(1, 128, 17, 192, 1, 1, 0)
verify_conv2d_nchw(1, 768, 17, 160, 1, 1, 0)
verify_conv2d_nchw(1, 160, 17, 160, 1, 1, 0)
verify_conv2d_nchw(1, 160, 17, 192, 7, 1, 3)
verify_conv2d_nchw(1, 160, 17, 160, 7, 1, 3)
verify_conv2d_nchw(1, 160, 17, 192, 1, 1, 0)
verify_conv2d_nchw(1, 192, 17, 192, 1, 1, 0)
verify_conv2d_nchw(1, 192, 17, 192, 7, 1, 3)
verify_conv2d_nchw(1, 192, 17, 320, 3, 2, 0)
verify_conv2d_nchw(1, 192, 17, 192, 3, 2, 0)
verify_conv2d_nchw(1, 1280, 8, 320, 1, 1, 0)
verify_conv2d_nchw(1, 1280, 8, 384, 1, 1, 0)
verify_conv2d_nchw(1, 384, 8, 384, 1, 1, 0)
......
"""Example code to do convolution."""
import numpy as np
import tvm
from tvm import autotvm
from tvm.autotvm.task.space import FallbackConfigEntity
import topi
import topi.testing
from tvm.contrib.pickle_memoize import memoize
from topi.util import get_const_tuple
def verify_conv2d_nchw(batch, in_channel, in_size, num_filter, kernel, stride, padding, dilation=1, add_bias=False, add_relu=False):
print("Workload: (%d, %d, %d, %d, %d, %d, %d)" % (batch, in_channel, in_size, num_filter, kernel, stride, padding))
in_height = in_width = in_size
A = tvm.placeholder((batch, in_channel, in_height, in_width), name='A')
W = tvm.placeholder((num_filter, in_channel, kernel, kernel), name='W')
bias = tvm.placeholder((num_filter, 1, 1), name='bias')
a_shape = get_const_tuple(A.shape)
w_shape = get_const_tuple(W.shape)
bias_shape = get_const_tuple(bias.shape)
dtype = A.dtype
@memoize("topi.tests.test_topi_conv2d_nchw.verify_conv2d_nchw")
def get_ref_data():
a_np = np.random.uniform(size=a_shape).astype(dtype)
w_np = np.random.uniform(size=w_shape).astype(dtype)
b_np = np.random.uniform(size=bias_shape).astype(dtype)
dw_np = topi.testing.dilate_python(w_np, (1, 1, dilation, dilation))
c_np = topi.testing.conv2d_nchw_python(a_np, dw_np, stride, padding)
if add_bias:
b_np = np.random.uniform(size=bias_shape).astype(dtype)
c_np += b_np
if add_relu:
c_np = np.maximum(c_np, 0)
return a_np, w_np, b_np, c_np
a_np, w_np, b_np, c_np = get_ref_data()
def check_device(device):
ctx = tvm.context(device, 0)
if not ctx.exist:
print("Skip because %s is not enabled" % device)
return
print("Running on target: %s" % device)
with tvm.target.create(device):
dW = topi.nn.dilate(W, (1, 1, dilation, dilation))
C = topi.nn.conv2d(A, dW, stride, padding, layout='NCHW', out_dtype=dtype)
if add_bias:
C = topi.add(C, bias)
if add_relu:
C = topi.nn.relu(C)
s = topi.generic.schedule_conv2d_nchw([C])
a = tvm.nd.array(a_np, ctx)
w = tvm.nd.array(w_np, ctx)
b = tvm.nd.array(b_np, ctx)
c = tvm.nd.array(np.zeros(get_const_tuple(C.shape), dtype=C.dtype), ctx)
if add_bias:
func = tvm.build(s, [A, W, bias, C], device, name="relu_%d_%d_%d_%d_%d_%d_%d_%d" % (batch, in_channel, in_size, num_filter, kernel, stride, padding, dilation))
func(a, w, b, c)
else:
func = tvm.build(s, [A, W, C], device, name="relu_%d_%d_%d_%d_%d_%d_%d_%d" % (batch, in_channel, in_size, num_filter, kernel, stride, padding, dilation))
func(a, w, c)
np.testing.assert_allclose(c.asnumpy(), c_np, rtol=1e-5)
for device in ['cuda', 'llvm -device=arm_cpu', 'opencl -device=mali']:
check_device(device)
class WinogradFallback(autotvm.FallbackContext):
def _query_inside(self, target, workload):
key = (target, workload)
if key in self.memory:
return self.memory[key]
cfg = FallbackConfigEntity()
cfg.template_key = 'winograd'
self.memory[key] = cfg
return cfg
def test_conv2d_nchw():
autotvm.DispatchContext.current.silent = True
with WinogradFallback():
# resnet 18 workloads
verify_conv2d_nchw(1, 64, 56, 64, 3, 1, 1)
verify_conv2d_nchw(1, 128, 28, 128, 3, 1, 1)
verify_conv2d_nchw(1, 256, 14, 256, 3, 1, 1)
verify_conv2d_nchw(1, 512, 7, 512, 3, 1, 1)
# batch size = 2
verify_conv2d_nchw(2, 64, 56, 64, 3, 1, 1)
# relu, bias
verify_conv2d_nchw(2, 64, 56, 64, 3, 1, 1, add_bias=True)
verify_conv2d_nchw(2, 64, 56, 64, 3, 1, 1, add_relu=True)
verify_conv2d_nchw(2, 64, 56, 64, 3, 1, 1, add_relu=True, add_bias=True)
# werid workloads
verify_conv2d_nchw(1, 1, 1, 1, 3, 1, 1)
verify_conv2d_nchw(3, 3, 3, 3, 3, 1, 1)
verify_conv2d_nchw(2, 13, 71, 59, 3, 1, 1)
if __name__ == "__main__":
test_conv2d_nchw()
import tvm
from tvm import autotvm
import topi
import topi.testing
import numpy as np
from topi.util import get_const_tuple
from topi.nn.util import get_pad_tuple
from tvm.contrib.pickle_memoize import memoize
from common import get_all_backend
......@@ -11,6 +13,16 @@ def depthwise_conv2d_with_workload_nchw(batch, in_channel, in_height, channel_mu
in_width = in_height
filter_channel = in_channel
filter_width = filter_height
stride_h = stride_w = stride
if dilation == 1:
# here we transform the padding argument from 'str' to 'tuple' ,
# because we need this to match the "workload" tuple to the records in TopHub
pad_h, pad_w, _, _ = get_pad_tuple(padding, (filter_height, filter_width))
padding_args = (pad_h, pad_w)
else:
padding_args = padding
# placeholder
Input = tvm.placeholder((batch, in_channel, in_height, in_width), name='Input')
Filter = tvm.placeholder((filter_channel, channel_multiplier, filter_height, filter_width), name='Filter')
......@@ -18,6 +30,8 @@ def depthwise_conv2d_with_workload_nchw(batch, in_channel, in_height, channel_mu
Scale = tvm.placeholder((in_channel * channel_multiplier,), name='Scale')
Shift = tvm.placeholder((in_channel * channel_multiplier,), name='Shift')
dtype = 'float32'
def check_device(device):
ctx = tvm.context(device, 0)
if not ctx.exist:
......@@ -26,7 +40,8 @@ def depthwise_conv2d_with_workload_nchw(batch, in_channel, in_height, channel_mu
print("Running on target: %s" % device)
with tvm.target.create(device):
# declare
DepthwiseConv2d = topi.nn.depthwise_conv2d_nchw(Input, DilatedFilter, stride=stride, padding=padding)
DepthwiseConv2d = topi.nn.depthwise_conv2d_nchw(Input, DilatedFilter,
(stride_h, stride_w), padding_args, dtype)
ScaleShift = topi.nn.scale_shift_nchw(DepthwiseConv2d, Scale, Shift)
Relu = topi.nn.relu(ScaleShift)
# schedule
......@@ -39,7 +54,6 @@ def depthwise_conv2d_with_workload_nchw(batch, in_channel, in_height, channel_mu
f3 = tvm.build(s3, [Input, Filter, Scale, Shift, Relu], device)
# Prepare pod type for test data closure
dtype = Input.dtype
input_shape = get_const_tuple(Input.shape)
filter_shape = get_const_tuple(Filter.shape)
scale_shape = get_const_tuple(Scale.shape)
......@@ -56,7 +70,7 @@ def depthwise_conv2d_with_workload_nchw(batch, in_channel, in_height, channel_mu
shift_np = np.random.uniform(size=shift_shape).astype(dtype)
# correctness with scipy
depthwise_conv2d_scipy = topi.testing.depthwise_conv2d_python_nchw(
input_np, dilated_filter_np, stride=stride, padding=padding)
input_np, dilated_filter_np, stride, padding)
scale_shift_scipy = np.zeros(shape=scale_shift_shape)
for c in range(in_channel * channel_multiplier):
scale_shift_scipy[:,c,:,:] = depthwise_conv2d_scipy[:,c,:,:] * scale_np[c] + shift_np[c]
......@@ -96,6 +110,15 @@ def depthwise_conv2d_with_workload_nhwc(batch, in_channel, in_height, channel_mu
filter_channel = in_channel
filter_width = filter_height
stride_w = stride_h
if dilation == 1:
# here we transform the padding argument from 'str' to 'tuple' ,
# because we need this to match the "workload" tuple to the records in TopHub
pad_h, pad_w, _, _ = get_pad_tuple(padding, (filter_height, filter_width))
padding_args = (pad_h, pad_w)
else:
padding_args = padding
# placeholder
Input = tvm.placeholder((batch, in_height, in_width, in_channel), name='Input')
Filter = tvm.placeholder((filter_height, filter_width,filter_channel, channel_multiplier), name='Filter')
......@@ -103,6 +126,8 @@ def depthwise_conv2d_with_workload_nhwc(batch, in_channel, in_height, channel_mu
Scale = tvm.placeholder((in_channel * channel_multiplier,), name='Scale')
Shift = tvm.placeholder((in_channel * channel_multiplier,), name='Shift')
dtype = 'float32'
def check_device(device):
ctx = tvm.context(device, 0)
if not ctx.exist:
......@@ -112,7 +137,8 @@ def depthwise_conv2d_with_workload_nhwc(batch, in_channel, in_height, channel_mu
with tvm.target.create(device):
# declare
DepthwiseConv2d = topi.nn.depthwise_conv2d_nhwc(Input, DilatedFilter, stride=[stride_h, stride_w], padding=padding)
DepthwiseConv2d = topi.nn.depthwise_conv2d_nhwc(Input, DilatedFilter,
(stride_h, stride_w), padding_args, dtype)
ScaleShift = topi.nn.scale_shift_nhwc(DepthwiseConv2d, Scale, Shift)
Relu = topi.nn.relu(ScaleShift)
# schedule
......@@ -125,7 +151,6 @@ def depthwise_conv2d_with_workload_nhwc(batch, in_channel, in_height, channel_mu
f3 = tvm.build(s3, [Input, Filter, Scale, Shift, Relu], device)
# Prepare pod type for test data closure
dtype = Input.dtype
input_shape = get_const_tuple(Input.shape)
filter_shape = get_const_tuple(Filter.shape)
scale_shape = get_const_tuple(Scale.shape)
......@@ -180,26 +205,36 @@ def depthwise_conv2d_with_workload_nhwc(batch, in_channel, in_height, channel_mu
def test_depthwise_conv2d():
print("testing nchw")
depthwise_conv2d_with_workload_nchw(1, 728, 64, 1, 3, 1, "SAME")
# load tophub
ctx = autotvm.apply_history_best([])
for device in get_all_backend():
context = autotvm.tophub.context(device)
context.__enter__()
# mobilenet workloads
depthwise_conv2d_with_workload_nchw(1, 32, 112, 1, 3, 1, "SAME")
depthwise_conv2d_with_workload_nchw(1, 64, 112, 1, 3, 2, "SAME")
depthwise_conv2d_with_workload_nchw(1, 128, 56, 1, 3, 1, "SAME")
depthwise_conv2d_with_workload_nchw(1, 128, 56, 1, 3, 2, "SAME")
depthwise_conv2d_with_workload_nchw(1, 256, 28, 1, 3, 1, "SAME")
depthwise_conv2d_with_workload_nchw(1, 256, 28, 1, 3, 2, "SAME")
depthwise_conv2d_with_workload_nchw(1, 512, 14, 1, 3, 1, "SAME")
depthwise_conv2d_with_workload_nchw(1, 512, 14, 1, 3, 2, "SAME")
depthwise_conv2d_with_workload_nchw(1, 1024, 7, 1, 3, 1, "SAME")
# NCHW
depthwise_conv2d_with_workload_nchw(1, 728, 32, 1, 3, 1, "SAME")
depthwise_conv2d_with_workload_nchw(4, 256, 64, 2, 5, 2, "SAME")
depthwise_conv2d_with_workload_nchw(4, 256, 32, 2, 5, 2, "SAME")
depthwise_conv2d_with_workload_nchw(1, 728, 64, 1, 3, 1, "VALID")
depthwise_conv2d_with_workload_nchw(1, 728, 32, 1, 3, 1, "VALID")
depthwise_conv2d_with_workload_nchw(4, 256, 64, 2, 5, 2, "VALID")
depthwise_conv2d_with_workload_nchw(4, 256, 32, 2, 5, 2, "VALID")
# dilation = 2
depthwise_conv2d_with_workload_nchw(1, 728, 64, 1, 3, 1, "SAME", dilation=2)
print("testing nhwc")
depthwise_conv2d_with_workload_nhwc(1, 728, 64, 1, 3, 1, "SAME")
# NHWC
depthwise_conv2d_with_workload_nhwc(1, 728, 32, 1, 3, 1, "SAME")
depthwise_conv2d_with_workload_nhwc(4, 256, 64, 2, 5, 2, "SAME")
depthwise_conv2d_with_workload_nhwc(4, 256, 32, 2, 5, 2, "SAME")
depthwise_conv2d_with_workload_nhwc(1, 728, 64, 1, 3, 1, "VALID")
depthwise_conv2d_with_workload_nhwc(1, 728, 32, 1, 3, 1, "VALID")
depthwise_conv2d_with_workload_nhwc(4, 256, 64, 2, 5, 2, "VALID")
depthwise_conv2d_with_workload_nhwc(4, 256, 32, 2, 5, 2, "VALID")
# dilation = 2
depthwise_conv2d_with_workload_nhwc(1, 728, 64, 1, 3, 1, "SAME", dilation=2)
......
......@@ -10,7 +10,7 @@ vendor provided library CuDNN in many cases.
######################################################################
# Install dependencies
# ----------------------------------------
# --------------------
# To use autotvm package in tvm, we need to install some extra dependencies.
# (change "3" to "2" if you use python2):
#
......@@ -20,7 +20,6 @@ vendor provided library CuDNN in many cases.
#
# To make tvm run faster in tuning, it is recommended to use cython
# as FFI of tvm. In the root directory of tvm, execute
# (change "3" to "2" if you use python2):
#
# .. code-block:: bash
#
......@@ -41,7 +40,7 @@ from tvm import autotvm
######################################################################
# Step 1: Define the search space
# ---------------------------------
# --------------------------------
# There are plenty of useful schedule primitives in tvm. You can also find
# some tutorials that describe them in more details, such as
# (1). :ref:`opt-conv-gpu`
......@@ -72,6 +71,21 @@ def conv2d_no_batching(N, H, W, CO, CI, KH, KW, stride, padding):
conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, 'float32')
s = tvm.create_schedule([conv.op])
##### space definition begin #####
n, f, y, x = s[conv].op.axis
rc, ry, rx = s[conv].op.reduce_axis
cfg = autotvm.get_config()
cfg.define_split("tile_f", f, num_outputs=4)
cfg.define_split("tile_y", y, num_outputs=4)
cfg.define_split("tile_x", x, num_outputs=4)
cfg.define_split("tile_rc", rc, num_outputs=3)
cfg.define_split("tile_ry", ry, num_outputs=3)
cfg.define_split("tile_rx", rx, num_outputs=3)
cfg.define_knob("auto_unroll_max_step", [0, 512, 1500])
cfg.define_knob("unroll_explicit", [0, 1])
##### space definition end #####
# inline padding
pad_data = s[conv].op.input_tensors[0]
s[pad_data].compute_inline()
......@@ -88,10 +102,6 @@ def conv2d_no_batching(N, H, W, CO, CI, KH, KW, stride, padding):
# tile and bind spatial axes
n, f, y, x = s[output].op.axis
cfg = autotvm.get_config()
cfg.define_split("tile_f", cfg.axis(f), num_outputs=4)
cfg.define_split("tile_y", cfg.axis(y), num_outputs=4)
cfg.define_split("tile_x", cfg.axis(x), num_outputs=4)
bf, vf, tf, fi = cfg["tile_f"].apply(s, output, f)
by, vy, ty, yi = cfg["tile_y"].apply(s, output, y)
bx, vx, tx, xi = cfg["tile_x"].apply(s, output, x)
......@@ -109,12 +119,9 @@ def conv2d_no_batching(N, H, W, CO, CI, KH, KW, stride, padding):
s[output].reorder(n, bf, by, bx, vf, vy, vx, tf, ty, tx, fi, yi, xi)
s[OL].compute_at(s[output], tx)
# tile and bind reduction axes
# tile reduction axes
n, f, y, x = s[OL].op.axis
rc, ry, rx = s[OL].op.reduce_axis
cfg.define_split("tile_rc", cfg.axis(rc), num_outputs=3)
cfg.define_split("tile_ry", cfg.axis(ry), num_outputs=3)
cfg.define_split("tile_rx", cfg.axis(rx), num_outputs=3)
rco, rcm, rci = cfg['tile_rc'].apply(s, OL, rc)
ryo, rym, ryi = cfg['tile_rx'].apply(s, OL, ry)
rxo, rxm, rxi = cfg['tile_ry'].apply(s, OL, rx)
......@@ -137,8 +144,6 @@ def conv2d_no_batching(N, H, W, CO, CI, KH, KW, stride, padding):
s[load].bind(tx, tvm.thread_axis("threadIdx.x"))
# tune unroll
cfg.define_knob("auto_unroll_max_step", [0, 512, 1500])
cfg.define_knob("unroll_explicit", [0, 1])
s[output].pragma(kernel_scope, 'auto_unroll_max_step', cfg['auto_unroll_max_step'].val)
s[output].pragma(kernel_scope, 'unroll_explicit', cfg['unroll_explicit'].val)
......
......@@ -8,9 +8,9 @@ performance. This is a tutorial about how to tune a whole convolutional
network.
The operator implementation for ARM CPU in TVM is written in template form.
It has many tunable knobs (tile factor, vectorization, unrolling, etc).
We will do tuning for all convolution and depthwise convolution operators
in the neural network. After the tuning, we can get a log file which stores
The template has many tunable knobs (tile factor, vectorization, unrolling, etc).
We will tune all convolution and depthwise convolution operators
in the neural network. After tuning, we produce a log file which stores
the best knob values for all required operators. When the tvm compiler compiles
these operators, it will query this log file to get the best knob values.
......@@ -21,15 +21,15 @@ to see the results.
######################################################################
# Install dependencies
# ----------------------------------------
# To use autotvm package in tvm, we need to install some extra dependencies.
# --------------------
# To use the autotvm package in tvm, we need to install some extra dependencies.
# (change "3" to "2" if you use python2):
#
# .. code-block:: bash
#
# pip3 install --user psutil xgboost tornado
#
# To make tvm run faster in tuning, it is recommended to use cython
# To make tvm run faster during tuning, it is recommended to use cython
# as FFI of tvm. In the root directory of tvm, execute
# (change "3" to "2" if you use python2):
#
......@@ -108,10 +108,9 @@ def get_network(name, batch_size):
# To scale up the tuning, TVM uses RPC Tracker to manage distributed devices.
# The RPC Tracker is a centralized master node. We can register all devices to
# the tracker. For example, if we have 10 phones, we can register all of them
# to the tracker, then we can run 10 measurements in parallel, which accelerates
# the tuning process.
# to the tracker, and run 10 measurements in parallel, accelerating the tuning process.
#
# To start an RPC tracker, run this command in the host machine. The tracker is
# To start an RPC tracker, run this command on the host machine. The tracker is
# required during the whole tuning process, so we need to open a new terminal for
# this command:
#
......@@ -144,6 +143,8 @@ def get_network(name, batch_size):
# * For Android:
# Follow this `readme page <https://github.com/dmlc/tvm/tree/master/apps/android_rpc>`_ to
# install tvm rpc apk on the android device. Make sure you can pass the android rpc test.
# Then you have already registred your device. During tuning, you have to go to developer option
# and enable "Keep screen awake during changing" and charge your phone to make it stable.
#
# After registering devices, we can confirm it by querying rpc_tracker
#
......@@ -170,7 +171,7 @@ def get_network(name, batch_size):
###########################################
# Set Tuning Options
# ------------------
# Before tuning, we should do some configurations. Here I use an RK3399 board
# Before tuning, we should apply some configurations. Here I use an RK3399 board
# as example. In your setting, you should modify the target and device_key accordingly.
# set :code:`use_android` to True if you use android phone.
......@@ -213,18 +214,20 @@ tuning_option = {
#
# .. note:: How to set tuning options
#
# In general, the default value provided here works well.
# If you have large time budget, you can set :code:`n_trial`, :code:`early_stopping` larger,
# In general, the default values provided here work well.
# If you have enough time budget, you can set :code:`n_trial`, :code:`early_stopping` larger,
# which makes the tuning run longer.
# If your device runs very slow or your conv2d operators have many GFLOPs, considering to
# set timeout larger.
#
###################################################################
# Begin Tuning
# ------------
# Now we can extract tuning tasks from the network and begin tuning.
# Here we provide a simple utility function to tune a list of tasks.
# Here, we provide a simple utility function to tune a list of tasks.
# This function is just an initial implementation which tunes them in sequential order.
# Later we will bring more sophisticated tuner scheduler.
# We will introduce a more sophisticated tuning scheduler in the future.
# You can skip the implementation of this function for this tutorial.
def tune_tasks(tasks,
......@@ -284,7 +287,7 @@ def tune_tasks(tasks,
########################################################################
# Finally we launch tuning jobs and evaluate the end-to-end performance.
# Finally, we launch tuning jobs and evaluate the end-to-end performance.
def tune_and_evaluate(tuning_opt):
# extract workloads from nnvm graph
......@@ -301,7 +304,7 @@ def tune_and_evaluate(tuning_opt):
# compile kernels with history best records
with autotvm.apply_history_best(log_file):
print("Compile...")
with nnvm.compiler.build_config(opt_level=2, add_pass=['AlterOpLayout']):
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target=target, shape={'data': input_shape}, params=params, dtype=dtype)
......@@ -338,7 +341,7 @@ def tune_and_evaluate(tuning_opt):
(np.mean(prof_res), np.std(prof_res)))
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run by yourself.
# Uncomment the following line to run it by yourself.
# tune_and_evaluate(tuning_option)
......@@ -373,9 +376,9 @@ def tune_and_evaluate(tuning_opt):
######################################################################
#
# .. note:: **Meet some problems?**
# .. note:: **Experiencing Difficulties?**
#
# The auto tuning module is error prone. If you always see " 0.00/ 0.00 GFLOPS",
# The auto tuning module is error-prone. If you always see " 0.00/ 0.00 GFLOPS",
# then there must be something wrong.
#
# First, make sure you set the correct configuration of your device.
......
"""
Auto-tuning a convolutional network for NVIDIA GPU
====================================================
**Author**: `Lianmin Zheng <https://https://github.com/merrymercy>`_
Auto-tuning for specific devices and workloads is critical for getting the
best performance. This is a tutorial on how to tune a whole convolutional
network for NVIDIA GPU.
The operator implementation for NVIDIA GPU in TVM is written in template form.
The template has many tunable knobs (tile factor, unrolling, etc).
We will tune all convolution and depthwise convolution operators
in the neural network. After tuning, we produce a log file which stores
the best knob values for all required operators. When the tvm compiler compiles
these operators, it will query this log file to get the best knob values.
We also released pre-tuned parameters for some NVIDIA GPUs. You can go to
`NVIDIA GPU Benchmark <https://github.com/dmlc/tvm/wiki/Benchmark#nvidia-gpu>`_
to see the results.
"""
######################################################################
# Install dependencies
# --------------------
# To use the autotvm package in tvm, we need to install some extra dependencies.
# (change "3" to "2" if you use python2):
#
# .. code-block:: bash
#
# pip3 install --user psutil xgboost tornado
#
# To make tvm run faster during tuning, it is recommended to use cython
# as FFI of tvm. In the root directory of tvm, execute:
#
# .. code-block:: bash
#
# pip3 install --user cython
# sudo make cython3
#
# Now return to python code. Import packages.
import os
import numpy as np
import nnvm.testing
import nnvm.compiler
import tvm
from tvm import autotvm
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime
#################################################################
# Define Network
# --------------
# First we need to define the network in nnvm symbol API.
# We can load some pre-defined network from :code:`nnvm.testing`.
# We can also load models from MXNet, ONNX and TensorFlow (see NNVM
# tutorials :ref:`tutorial-nnvm` for more details).
def get_network(name, batch_size):
"""Get the symbol definition and random weight of a network"""
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size)
elif "vgg" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.vgg.get_workload(num_layers=n_layer, batch_size=batch_size)
elif name == 'mobilenet':
net, params = nnvm.testing.mobilenet.get_workload(batch_size=batch_size)
elif name == 'squeezenet_v1.1':
net, params = nnvm.testing.squeezenet.get_workload(batch_size=batch_size, version='1.1')
elif name == 'inception_v3':
input_shape = (1, 3, 299, 299)
net, params = nnvm.testing.inception_v3.get_workload(batch_size=batch_size)
elif name == 'custom':
# an example for custom network
from nnvm.testing import utils
net = nnvm.sym.Variable('data')
net = nnvm.sym.conv2d(net, channels=4, kernel_size=(3,3), padding=(1,1))
net = nnvm.sym.flatten(net)
net = nnvm.sym.dense(net, units=1000)
net, params = utils.create_workload(net, batch_size, (3, 224, 224))
elif name == 'mxnet':
# an example for mxnet model
from mxnet.gluon.model_zoo.vision import get_model
block = get_model('resnet18_v1', pretrained=True)
net, params = nnvm.frontend.from_mxnet(block)
net = nnvm.sym.softmax(net)
else:
raise ValueError("Unsupported network: " + name)
return net, params, input_shape, output_shape
###########################################
# Set Tuning Options
# ------------------
# Before tuning, we apply some configurations.
#### DEVICE CONFIG ####
target = tvm.target.cuda()
#### TUNING OPTION ####
network = 'resnet-18'
log_file = "%s.log" % network
dtype = 'float32'
tuning_option = {
'log_filename': log_file,
'tuner': 'xgb',
'n_trial': 2000,
'early_stopping': 600,
'measure_option': autotvm.measure_option(
builder=autotvm.LocalBuilder(timeout=10),
runner=autotvm.LocalRunner(number=20, repeat=3, timeout=4),
),
}
####################################################################
#
# .. note:: How to set tuning options
#
# In general, the default value provided here works well.
#
# If you have large time budget, you can set :code:`n_trial`, :code:`early_stopping` larger,
# which makes the tuning runs longer.
#
# If you have multiple devices, you can use all of them for measurement to
# accelerate the tuning process. (see the 'Scale up measurement` section below).
#
###################################################################
# Begin Tuning
# ------------
# Now we can extract tuning tasks from the network and begin tuning.
# Here, we provide a simple utility function to tune a list of tasks.
# This function is just an initial implementation which tunes them in sequential order.
# We will introduce a more sophisticated tuning scheduler in the future.
# You can skip the implementation of this function for this tutorial.
def tune_tasks(tasks,
measure_option,
tuner='xgb',
n_trial=1000,
early_stopping=None,
log_filename='tuning.log',
use_transfer_learning=True,
try_winograd=True):
if try_winograd:
for i in range(len(tasks)):
try: # try winograd template
tsk = autotvm.task.create(tasks[i].name, tasks[i].args,
tasks[i].target, tasks[i].target_host, 'winograd')
input_channel = tsk.workload[1][1]
if input_channel >= 64:
tasks[i] = tsk
except Exception:
pass
# create tmp log file
tmp_log_file = log_filename + ".tmp"
if os.path.exists(tmp_log_file):
os.remove(tmp_log_file)
for i, tsk in enumerate(reversed(tasks)):
prefix = "[Task %2d/%2d] " %(i+1, len(tasks))
# create tuner
if tuner == 'xgb' or tuner == 'xgb-rank':
tuner_obj = XGBTuner(tsk, loss_type='rank')
elif tuner == 'ga':
tuner_obj = GATuner(tsk, pop_size=100)
elif tuner == 'random':
tuner_obj = RandomTuner(tsk)
elif tuner == 'gridsearch':
tuner_obj = GridSearchTuner(tsk)
else:
raise ValueError("Invalid tuner: " + tuner)
if use_transfer_learning:
if os.path.isfile(tmp_log_file):
tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
# do tuning
tuner_obj.tune(n_trial=min(n_trial, len(tsk.config_space)),
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(n_trial, prefix=prefix),
autotvm.callback.log_to_file(tmp_log_file)])
# pick best records to a cache file
autotvm.record.pick_best(tmp_log_file, log_filename)
os.remove(tmp_log_file)
########################################################################
# Finally, we launch tuning jobs and evaluate the end-to-end performance.
def tune_and_evaluate(tuning_opt):
# extract workloads from nnvm graph
print("Extract tasks...")
net, params, input_shape, out_shape = get_network(network, batch_size=1)
tasks = autotvm.task.extract_from_graph(net, target=target,
shape={'data': input_shape}, dtype=dtype,
symbols=(nnvm.sym.conv2d,))
# run tuning tasks
print("Tuning...")
tune_tasks(tasks, **tuning_opt)
# compile kernels with history best records
with autotvm.apply_history_best(log_file):
print("Compile...")
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target=target, shape={'data': input_shape}, params=params, dtype=dtype)
# export library
tmp = tempdir()
filename = "net.tar"
lib.export_library(tmp.relpath(filename))
# load parameters
ctx = tvm.context(str(target), 0)
params_tvm = {k: tvm.nd.array(v, ctx) for k, v in params.items()}
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module = runtime.create(graph, lib, ctx)
module.set_input('data', data_tvm)
module.set_input(**params_tvm)
# evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, number=400, repeat=3)
prof_res = np.array(ftimer().results) * 1000 # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
(np.mean(prof_res), np.std(prof_res)))
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.
# tune_and_evaluate(tuning_option)
######################################################################
# Sample Output
# -------------
# The tuning needs to compile many programs and extract feature from them.
# So a high performance CPU is recommended. One sample output is listed below.
# It takes about 4 hours to get the following output on a 32T AMD Ryzen Threadripper.
# The tuning target is NVIDIA 1080 Ti.
# (You can see some errors during compilation. If the tuning is not stuck, it is okay.)
#
# .. code-block:: bash
#
# Extract tasks...
# Tuning...
# [Task 1/12] Current/Best: 541.83/3570.66 GFLOPS | Progress: (960/2000) | 1001.31 s Done.
# [Task 2/12] Current/Best: 0.56/ 803.33 GFLOPS | Progress: (704/2000) | 608.08 s Done.
# [Task 3/12] Current/Best: 103.69/1141.25 GFLOPS | Progress: (768/2000) | 702.13 s Done.
# [Task 4/12] Current/Best: 2905.03/3925.15 GFLOPS | Progress: (864/2000) | 745.94 sterminate called without an active exception
# [Task 4/12] Current/Best: 2789.36/3925.15 GFLOPS | Progress: (1056/2000) | 929.40 s Done.
# [Task 5/12] Current/Best: 89.06/1076.24 GFLOPS | Progress: (704/2000) | 601.73 s Done.
# [Task 6/12] Current/Best: 40.39/2129.02 GFLOPS | Progress: (1088/2000) | 1125.76 s Done.
# [Task 7/12] Current/Best: 4090.53/5007.02 GFLOPS | Progress: (800/2000) | 903.90 s Done.
# [Task 8/12] Current/Best: 4.78/1272.28 GFLOPS | Progress: (768/2000) | 749.14 s Done.
# [Task 9/12] Current/Best: 1391.45/2325.08 GFLOPS | Progress: (992/2000) | 1084.87 s Done.
# [Task 10/12] Current/Best: 1995.44/2383.59 GFLOPS | Progress: (864/2000) | 862.60 s Done.
# [Task 11/12] Current/Best: 4093.94/4899.80 GFLOPS | Progress: (224/2000) | 240.92 sterminate called without an active exception
# [Task 11/12] Current/Best: 3487.98/4909.91 GFLOPS | Progress: (480/2000) | 534.96 sterminate called without an active exception
# [Task 11/12] Current/Best: 4636.84/4912.17 GFLOPS | Progress: (1184/2000) | 1381.16 sterminate called without an active exception
# [Task 11/12] Current/Best: 50.12/4912.17 GFLOPS | Progress: (1344/2000) | 1602.81 s Done.
# [Task 12/12] Current/Best: 3581.31/4286.30 GFLOPS | Progress: (736/2000) | 943.52 s Done.
# Compile...
# Evaluate inference time cost...
# Mean inference time (std dev): 1.07 ms (0.05 ms)
#
# As a reference baseline, the time cost of MXNet + TensorRT on resnet-18 is 1.30ms. So we are a little faster.
######################################################################
#
# .. note:: **Experiencing Difficulties?**
#
# The auto tuning module is error-prone. If you always see " 0.00/ 0.00 GFLOPS",
# then there must be something wrong.
#
# First, make sure you set the correct configuration of your device.
# Then, you can print debug information by adding these lines in the beginning
# of the script. It will print every measurement result, where you can find useful
# error messages.
#
# .. code-block:: python
#
# import logging
# logging.getLogger('autotvm').setLevel(logging.DEBUG)
#
# Finally, always feel free to ask our community for help on https://discuss.tvm.ai
#################################################################
# Scale up measurement by using multiple devices
# ----------------------------------------------
#
# If you have multiple devices, you can use all of them for measurement.
# TVM uses the RPC Tracker to manage distributed devices.
# The RPC Tracker is a centralized master node. We can register all devices to
# the tracker. For example, if we have 10 GPU cards, we can register all of them
# to the tracker, and run 10 measurements in parallel, accelerating the tuning process.
#
# To start an RPC tracker, run this command on the host machine. The tracker is
# required during the whole tuning process, so we need to open a new terminal for
# this command:
#
# .. code-block:: bash
#
# python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
#
# The expected output is
#
# .. code-block:: bash
#
# INFO:RPCTracker:bind to 0.0.0.0:9190
#
# Then open another new terminal for the RPC server. We need to start one server
# for each dedicated device. We use a string key to distinguish the types of devices.
# You can pick a name you like.
# (Note: For rocm backend, there are some internal errors with the compiler,
# we need to add `--no-fork` to the argument list.)
#
# .. code-block:: bash
#
# python -m tvm.exec.rpc_server --tracker=localhost:9190 --key=1080ti
#
# After registering devices, we can confirm it by querying rpc_tracker
#
# .. code-block:: bash
#
# python -m tvm.exec.query_rpc_tracker --host=localhost --port=9190
#
# For example, if we have four 1080ti, two titanx and one gfx900, the output can be
#
# .. code-block:: bash
#
# Queue Status
# ----------------------------------
# key total free pending
# ----------------------------------
# 1080ti 4 4 0
# titanx 2 2 0
# gfx900 1 1 0
# ----------------------------------
#
# Finally, we need to change the tuning option to use RPCRunner. Use the code below
# to replace the corresponding part above.
tuning_option = {
'log_filename': log_file,
'tuner': 'xgb',
'n_trial': 2000,
'early_stopping': 600,
'measure_option': autotvm.measure_option(
builder=autotvm.LocalBuilder(timeout=10),
runner=autotvm.RPCRunner(
'1080ti', # change the device key to your key
'localhost', 9190,
number=20, repeat=3, timeout=4),
),
}
"""
Auto-tuning a convolutional network for Mobile GPU
====================================================
**Author**: `Lianmin Zheng <https://https://github.com/merrymercy>`_
Auto-tuning for a specific device is critical for getting the best
performance. This is a tutorial about how to tune a whole convolutional
network.
The operator implementation for Mobile GPU in TVM is written in template form.
The template has many tunable knobs (tile factor, vectorization, unrolling, etc).
We will tune all convolution, depthwise convolution and dense operators
in the neural network. After tuning, we produce a log file which stores
the best knob values for all required operators. When the tvm compiler compiles
these operators, it will query this log file to get the best knob values.
We also released pre-tuned parameters for some arm devices. You can go to
`Mobile GPU Benchmark <https://github.com/dmlc/tvm/wiki/Benchmark#mobile-gpu>`_
to see the results.
"""
######################################################################
# Install dependencies
# --------------------
# To use the autotvm package in tvm, we need to install some extra dependencies.
# (change "3" to "2" if you use python2):
#
# .. code-block:: bash
#
# pip3 install --user psutil xgboost tornado
#
# To make tvm run faster during tuning, it is recommended to use cython
# as FFI of tvm. In the root directory of tvm, execute
# (change "3" to "2" if you use python2):
#
# .. code-block:: bash
#
# pip3 install --user cython
# sudo make cython3
#
# Now return to python code. Import packages.
import os
import numpy as np
import nnvm.testing
import nnvm.compiler
import tvm
from tvm import autotvm
from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
from tvm.contrib.util import tempdir
import tvm.contrib.graph_runtime as runtime
#################################################################
# Define network
# --------------
# First we need to define the network in nnvm symbol API.
# We can load some pre-defined network from :code:`nnvm.testing`.
# We can also load models from MXNet, ONNX and TensorFlow (see NNVM
# tutorials :ref:`tutorial-nnvm` for more details).
def get_network(name, batch_size):
"""Get the symbol definition and random weight of a network"""
input_shape = (batch_size, 3, 224, 224)
output_shape = (batch_size, 1000)
if "resnet" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size)
elif "vgg" in name:
n_layer = int(name.split('-')[1])
net, params = nnvm.testing.vgg.get_workload(num_layers=n_layer, batch_size=batch_size)
elif name == 'mobilenet':
net, params = nnvm.testing.mobilenet.get_workload(batch_size=batch_size)
elif name == 'squeezenet_v1.1':
net, params = nnvm.testing.squeezenet.get_workload(batch_size=batch_size, version='1.1')
elif name == 'inception_v3':
input_shape = (1, 3, 299, 299)
net, params = nnvm.testing.inception_v3.get_workload(batch_size=batch_size)
elif name == 'custom':
# an example for custom network
from nnvm.testing import utils
net = nnvm.sym.Variable('data')
net = nnvm.sym.conv2d(net, channels=4, kernel_size=(3,3), padding=(1,1))
net = nnvm.sym.flatten(net)
net = nnvm.sym.dense(net, units=1000)
net, params = utils.create_workload(net, batch_size, (3, 224, 224))
elif name == 'mxnet':
# an example for mxnet model
from mxnet.gluon.model_zoo.vision import get_model
block = get_model('resnet18_v1', pretrained=True)
net, params = nnvm.frontend.from_mxnet(block)
net = nnvm.sym.softmax(net)
else:
raise ValueError("Unsupported network: " + name)
return net, params, input_shape, output_shape
#################################################################
# Start RPC Tracker
# -----------------
# TVM uses RPC session to communicate with ARM boards.
# During tuning, the tuner will send the generated code to the board and
# measure the speed of code on the board.
#
# To scale up the tuning, TVM uses RPC Tracker to manage distributed devices.
# The RPC Tracker is a centralized master node. We can register all devices to
# the tracker. For example, if we have 10 phones, we can register all of them
# to the tracker, and run 10 measurements in parallel, accelerating the tuning process.
#
# To start an RPC tracker, run this command on the host machine. The tracker is
# required during the whole tuning process, so we need to open a new terminal for
# this command:
#
# .. code-block:: bash
#
# python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190
#
# The expected output is
#
# .. code-block:: bash
#
# INFO:RPCTracker:bind to 0.0.0.0:9190
#################################################################
# Register devices to RPC Tracker
# -----------------------------------
# Now we can register our devices to the tracker. The first step is to
# build tvm runtime for the ARM devices.
#
# * For Linux:
# Follow this section :ref:`build-tvm-runtime-on-device` to build
# tvm runtime on the device. Then register the device to tracker by
#
# .. code-block:: bash
#
# python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=rk3399
#
# (replace :code:`[HOST_IP]` with the IP address of your host machine)
#
# * For Android:
# Follow this `readme page <https://github.com/dmlc/tvm/tree/master/apps/android_rpc>`_ to
# install tvm rpc apk on the android device. Make sure you can pass the android rpc test.
# Then you have already registred your device. During tuning, you have to go to developer option
# and enable "Keep screen awake during changing" and charge your phone to make it stable.
#
# After registering devices, we can confirm it by querying rpc_tracker
#
# .. code-block:: bash
#
# python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190
#
# For example, if we have 2 Huawei mate10 pro, 11 Raspberry Pi 3B and 2 rk3399,
# the output can be
#
# .. code-block:: bash
#
# Queue Status
# ----------------------------------
# key total free pending
# ----------------------------------
# mate10pro 2 2 0
# rk3399 2 2 0
# rpi3b 11 11 0
# ----------------------------------
#
# You can register multiple devices to the tracker to accelerate the measurement in tuning.
###########################################
# Set Tuning Options
# ------------------
# Before tuning, we should apply some configurations. Here I use an RK3399 board
# as example. In your setting, you should modify the target and device_key accordingly.
# set :code:`use_android` to True if you use android phone.
#### DEVICE CONFIG ####
target = tvm.target.create('opencl -device=mali')
# Replace "aarch64-linux-gnu" with the correct target of your board.
# This target host is used for cross compilation. You can query it by :code:`gcc -v` on your device.
target_host = 'llvm -target=aarch64-linux-gnu'
# Also replace this with the device key in your tracker
device_key = 'rk3399'
# Set this to True if you use android phone
use_android = False
#### TUNING OPTION ####
network = 'resnet-18'
log_file = "%s.%s.log" % (device_key, network)
dtype = 'float32'
tuning_option = {
'log_filename': log_file,
'tuner': 'xgb',
'n_trial': 1000,
'early_stopping': 450,
'measure_option': autotvm.measure_option(
builder=autotvm.LocalBuilder(
build_func='ndk' if use_android else 'default'),
runner=autotvm.RPCRunner(
device_key, host='localhost', port=9190,
number=10,
timeout=5,
),
),
}
####################################################################
#
# .. note:: How to set tuning options
#
# In general, the default values provided here work well.
# If you have enough time budget, you can set :code:`n_trial`, :code:`early_stopping` larger,
# which makes the tuning run longer.
# If your device runs very slow or your conv2d operators have many GFLOPs, considering to
# set timeout larger.
#
###################################################################
# Begin Tuning
# ------------
# Now we can extract tuning tasks from the network and begin tuning.
# Here, we provide a simple utility function to tune a list of tasks.
# This function is just an initial implementation which tunes them in sequential order.
# We will introduce a more sophisticated tuning scheduler in the future.
# You can skip the implementation of this function for this tutorial.
def tune_tasks(tasks,
measure_option,
tuner='xgb',
n_trial=1000,
early_stopping=None,
log_filename='tuning.log',
use_transfer_learning=True,
try_winograd=True):
if try_winograd:
for i in range(len(tasks)):
try: # try winograd template
tsk = autotvm.task.create(tasks[i].name, tasks[i].args,
tasks[i].target, tasks[i].target_host, 'winograd')
tasks.append(tsk)
except Exception:
pass
# create tmp log file
tmp_log_file = log_filename + ".tmp"
if os.path.exists(tmp_log_file):
os.remove(tmp_log_file)
for i, tsk in enumerate(reversed(tasks)):
prefix = "[Task %2d/%2d] " % (i+1, len(tasks))
# create tuner
if tuner == 'xgb' or tuner == 'xgb-rank':
tuner_obj = XGBTuner(tsk, loss_type='rank')
elif tuner == 'ga':
tuner_obj = GATuner(tsk, pop_size=50)
elif tuner == 'random':
tuner_obj = RandomTuner(tsk)
elif tuner == 'gridsearch':
tuner_obj = GridSearchTuner(tsk)
else:
raise ValueError("Invalid tuner: " + tuner)
if use_transfer_learning:
if os.path.isfile(tmp_log_file):
tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))
# do tuning
tuner_obj.tune(n_trial=min(n_trial, len(tsk.config_space)),
early_stopping=early_stopping,
measure_option=measure_option,
callbacks=[
autotvm.callback.progress_bar(n_trial, prefix=prefix),
autotvm.callback.log_to_file(tmp_log_file)])
# pick best records to a cache file
autotvm.record.pick_best(tmp_log_file, log_filename)
os.remove(tmp_log_file)
########################################################################
# Finally, we launch tuning jobs and evaluate the end-to-end performance.
def tune_and_evaluate(tuning_opt):
# extract workloads from nnvm graph
print("Extract tasks...")
net, params, input_shape, out_shape = get_network(network, batch_size=1)
tasks = autotvm.task.extract_from_graph(net, target=target, target_host=target_host,
shape={'data': input_shape}, dtype=dtype,
symbols=(nnvm.sym.conv2d, nnvm.sym.dense))
# run tuning tasks
print("Tuning...")
tune_tasks(tasks, **tuning_opt)
# compile kernels with history best records
with autotvm.apply_history_best(log_file):
print("Compile...")
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target=target, target_host=target_host,
shape={'data': input_shape}, params=params, dtype=dtype)
# export library
tmp = tempdir()
if use_android:
from tvm.contrib import ndk
filename = "net.so"
lib.export_library(tmp.relpath(filename), ndk.create_shared)
else:
filename = "net.tar"
lib.export_library(tmp.relpath(filename))
# upload module to device
print("Upload...")
remote = autotvm.measure.request_remote(device_key, 'localhost', 9190,
timeout=10000)
remote.upload(tmp.relpath(filename))
rlib = remote.load_module(filename)
# upload parameters to device
ctx = remote.context(str(target), 0)
rparams = {k: tvm.nd.array(v, ctx) for k, v in params.items()}
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module = runtime.create(graph, rlib, ctx)
module.set_input('data', data_tvm)
module.set_input(**rparams)
# evaluate
print("Evaluate inference time cost...")
ftimer = module.module.time_evaluator("run", ctx, number=50, repeat=3)
prof_res = np.array(ftimer().results) * 1000 # convert to millisecond
print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
(np.mean(prof_res), np.std(prof_res)))
# We do not run the tuning in our webpage server since it takes too long.
# Uncomment the following line to run it by yourself.
# tune_and_evaluate(tuning_option)
######################################################################
# Sample Output
# -------------
# The tuning needs to compile many programs and extract feature from them.
# So a high performance CPU is recommended.
# One sample output is listed below. It takes about 3 hours on a 32T AMD Ryzen Threadripper.
#
# .. code-block:: bash
#
# Extract tasks...
# Tuning...
# [Task 1/17] Current/Best: 12.22/ 36.05 GFLOPS | Progress: (32/1000) | 42.12 s
#
# (The following part is running, will update it later).
######################################################################
#
# .. note:: **Experiencing Difficulties?**
#
# The auto tuning module is error-prone. If you always see " 0.00/ 0.00 GFLOPS",
# then there must be something wrong.
#
# First, make sure you set the correct configuration of your device.
# Then, you can print debug information by adding these lines in the beginning
# of the script. It will print every measurement result, where you can find useful
# error messages.
#
# .. code-block:: python
#
# import logging
# logging.getLogger('autotvm').setLevel(logging.DEBUG)
#
# Finally, always feel free to ask our community for help on https://discuss.tvm.ai
......@@ -14,7 +14,7 @@ The whole workflow is illustrated by a matrix multiplication example.
######################################################################
# Install dependencies
# ----------------------------------------
# --------------------
# To use autotvm package in tvm, we need to install some extra dependencies.
# (change "3" to "2" if you use python2):
#
......@@ -44,7 +44,7 @@ from tvm import autotvm
######################################################################
# Step 1: Define the search space
# ---------------------------------
# --------------------------------
# In this section, we will rewrite a deterministic tvm schedule code to a
# tunable schedule template. You can regard the process of search space definition
# as the parametrization of our exiting schedule code.
......@@ -73,7 +73,7 @@ def matmul_v0(N, L, M, dtype):
#####################################################################
# Parametrize the schedule
# ^^^^^^^^^^^^^^^^^^^^^^^^^
# ^^^^^^^^^^^^^^^^^^^^^^^^
# In the previous schedule code, we use a constant "8" as tiling factor.
# However, it might not be the best one because the best tiling factor depends
# on real hardware environment and input shape.
......
......@@ -165,7 +165,7 @@ else:
# optimization for mali
target = tvm.target.mali()
with nnvm.compiler.build_config(opt_level=2):
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(net, target=target,
shape={"data": data_shape}, params=params, target_host=target_host)
......
......@@ -156,7 +156,7 @@ else:
# The above line is a simple form of
# target = tvm.target.create('llvm -devcie=arm_cpu -model=bcm2837 -target=armv7l-linux-gnueabihf -mattr=+neon')
with nnvm.compiler.build_config(opt_level=2, add_pass=['AlterOpLayout']):
with nnvm.compiler.build_config(opt_level=3):
graph, lib, params = nnvm.compiler.build(
net, target, shape={"data": data_shape}, params=params)
......
......@@ -103,13 +103,22 @@ with tvm.target.create("cuda"):
######################################################################
# Fusing convolutions
# -------------------
# We can fuse :code:`topi.nn.conv2d` and :code:`topi.nn.relu` together
# We can fuse :code:`topi.nn.conv2d` and :code:`topi.nn.relu` together.
#
# .. note::
#
# TOPI functions are all generic functions. They have different implementations
# for different backends to optimize for performance.
# For each backend, it is necessary to call them under a target scope for both
# compute declaration and schedule. TVM will choose the right function to call with
# the target information.
data = tvm.placeholder((1, 3, 224, 224))
kernel = tvm.placeholder((10, 3, 5, 5))
conv = topi.nn.conv2d(data, kernel, strides=1, padding=2)
out = topi.nn.relu(conv)
with tvm.target.create("cuda"):
conv = topi.nn.conv2d(data, kernel, strides=1, padding=2)
out = topi.nn.relu(conv)
sconv = topi.generic.nn.schedule_conv2d_nchw(out)
print(tvm.lower(sconv, [data, kernel], simple_mode=True))
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment