AutoTVM x86 (#1772)

* AutoTVM for x86 conv2d * Add ApplyGraphBest dispatch context * Fix tutorial * Fix conv2d * Improve tutorial * Fix default schedule * Fix 1x1 default schedule loading * Fix workload type * Change gridsearch to random * Add reference to autotvm arm * Merge conv2d common and 1x1 decl * Fix lint * Minor fix

AutoTVM x86 (#1772)
* AutoTVM for x86 conv2d * Add ApplyGraphBest dispatch context * Fix tutorial * Fix conv2d * Improve tutorial * Fix default schedule * Fix 1x1 default schedule loading * Fix workload type * Change gridsearch to random * Add reference to autotvm arm * Merge conv2d common and 1x1 decl * Fix lint * Minor fix
dc996e45 · Yao Wang · Yizhi Liu · 86cb8ea2 · dc996e45 · dc996e45
Commit dc996e45 authored Oct 17, 2018 by Yao Wang Committed by Yizhi Liu Oct 17, 2018
7 changed files
--- a/python/tvm/autotvm/__init__.py
+++ b/python/tvm/autotvm/__init__.py
@@ -27,5 +27,6 @@ from .measure import measure_option, MeasureInput, MeasureResult, MeasureErrorNo
 from .tuner import callback
 from .task import template, get_config, create, ConfigSpace, ConfigEntity, \
    register_topi_compute, register_topi_schedule, \
-    DispatchContext, FallbackContext, ApplyHistoryBest as apply_history_best
+    DispatchContext, FallbackContext, ApplyHistoryBest as apply_history_best, \
+    ApplyGraphBest as apply_graph_best
 from .env import GLOBAL_SCOPE
--- a/python/tvm/autotvm/task/__init__.py
+++ b/python/tvm/autotvm/task/__init__.py
@@ -10,7 +10,7 @@ from .task import Task, create, register, template, get_config, args_to_workload
 from .space import ConfigSpace, ConfigEntity
 from .code_hash import attach_code_hash, attach_code_hash_to_arg
 from .dispatcher import dispatcher, DispatchContext, ApplyConfig, ApplyHistoryBest, \
-    FallbackContext, clear_fallback_cache
+    FallbackContext, clear_fallback_cache, ApplyGraphBest

 from .topi_integration import register_topi_compute, register_topi_schedule
 from .nnvm_integration import extract_from_graph, extract_from_multiple_graph
--- a/python/tvm/autotvm/task/dispatcher.py
+++ b/python/tvm/autotvm/task/dispatcher.py
@@ -345,3 +345,83 @@ def clear_fallback_cache(target, workload):
    while not isinstance(context, FallbackContext):
        context = context._old_ctx
    context.clear_cache(target, workload)
+
+class ApplyGraphBest(DispatchContext):
+    """Load the graph level tuning optimal schedules.
+
+    The input records should be in the ascending order of
+    node index for target operator. Usually this can be obtained
+    with graph tuner.
+
+    This context maintains an internal counter to indicate the current
+    node index.
+    """
+    def __init__(self, records):
+        """
+        Parameters
+        ----------
+        records : str or iterator of (MeasureInput, MeasureResult)
+            Collection of tuning records.
+            If is str, then it should be the filename of a records log file.
+                   Each row of this file is an encoded record pair.
+            Otherwise, it is an iterator.
+        """
+        from ..record import load_from_file
+
+        super(ApplyGraphBest, self).__init__()
+        if isinstance(records, str):
+            records = load_from_file(records)
+        self._records = list(records)
+        self._counter = 0
+        self._global_cfg_dict = {}
+
+    def _query_inside(self, target, workload):
+        """
+        Query the context to get config from records.
+
+        Parameters
+        ----------
+        target : Target
+            The current target
+        workload : Workload
+            The current workload.
+
+        Returns
+        -------
+        cfg : ConfigSpace
+            The specific configuration.
+        """
+        cfg = self._records[self._counter][0].config
+        self._counter += 1
+        return cfg
+
+    def query_global_dict(self, key):
+        """
+        Query the context to get config from global
+        config dictionary.
+
+        Parameters
+        ----------
+        key : str
+            Key to query the config.
+
+        Returns
+        -------
+        cfg : ConfigSpace
+            The specific configuration.
+        """
+        return self._global_cfg_dict[key]
+
+    def update_global_dict(self, key, val):
+        """
+        Update the global config dictionary.
+
+        Parameters
+        ----------
+        key : str
+            Key of config.
+
+        val : ConfigSpace
+            Value of config.
+        """
+        self._global_cfg_dict[key] = val
--- a/topi/python/topi/x86/conv2d.py
+++ b/topi/python/topi/x86/conv2d.py
--- a/topi/python/topi/x86/conv2d_avx_1x1.py
+++ b/topi/python/topi/x86/conv2d_avx_1x1.py
--- a/topi/python/topi/x86/conv2d_avx_common.py
+++ b/topi/python/topi/x86/conv2d_avx_common.py
--- a/tutorials/autotvm/tune_nnvm_x86.py
+++ b/tutorials/autotvm/tune_nnvm_x86.py
+"""
+Auto-tuning a convolutional network for x86 CPU
+====================================================
+**Author**: `Yao Wang <https://github.com/kevinthesun>`_
+
+This is a tutorial about how to tune convolution neural network
+for x86 cpu.
+"""
+import os
+import numpy as np
+
+import nnvm.testing
+import nnvm.compiler
+import tvm
+from tvm import autotvm
+from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner
+from topi.x86.conv2d import conv_NCHWc_arg_to_workload
+import tvm.contrib.graph_runtime as runtime
+
+#################################################################
+# Define network
+# --------------
+# First we need to define the network in nnvm symbol API.
+# We can load some pre-defined network from :code:`nnvm.testing`.
+# We can also load models from MXNet, ONNX and TensorFlow (see NNVM
+# tutorials :ref:`tutorial-nnvm` for more details).
+#
+# In this tutorial, we choose resnet-18 as tuning example.
+
+def get_network(name, batch_size):
+    """Get the symbol definition and random weight of a network"""
+    input_shape = (batch_size, 3, 224, 224)
+    output_shape = (batch_size, 1000)
+
+    if "resnet" in name:
+        n_layer = int(name.split('-')[1])
+        net, params = nnvm.testing.resnet.get_workload(num_layers=n_layer, batch_size=batch_size)
+    elif "vgg" in name:
+        n_layer = int(name.split('-')[1])
+        net, params = nnvm.testing.vgg.get_workload(num_layers=n_layer, batch_size=batch_size)
+    elif name == 'mobilenet':
+        net, params = nnvm.testing.mobilenet.get_workload(batch_size=batch_size)
+    elif name == 'squeezenet_v1.1':
+        net, params = nnvm.testing.squeezenet.get_workload(batch_size=batch_size, version='1.1')
+    elif name == 'inception_v3':
+        input_shape = (1, 3, 299, 299)
+        net, params = nnvm.testing.inception_v3.get_workload(batch_size=batch_size)
+    elif name == 'custom':
+        # an example for custom network
+        from nnvm.testing import utils
+        net = nnvm.sym.Variable('data')
+        net = nnvm.sym.conv2d(net, channels=4, kernel_size=(3,3), padding=(1,1))
+        net = nnvm.sym.flatten(net)
+        net = nnvm.sym.dense(net, units=1000)
+        net, params = utils.create_workload(net, batch_size, (3, 224, 224))
+    elif name == 'mxnet':
+        # an example for mxnet model
+        from mxnet.gluon.model_zoo.vision import get_model
+        block = get_model('resnet18_v1', pretrained=True)
+        net, params = nnvm.frontend.from_mxnet(block)
+        net = nnvm.sym.softmax(net)
+    else:
+        raise ValueError("Unsupported network: " + name)
+
+    return net, params, input_shape, output_shape
+
+# Replace "llvm" with the correct target of your cpu.
+# For example, for AWS EC2 c5 instance with Intel Xeon
+# Platinum 8000 series, the target should be "llvm -mcpu=skylake-avx512".
+# For AWS EC2 c4 instance with Intel Xeon E5-2666 v3, it should be
+# "llvm -mcpu=core-avx2".
+target = "llvm"
+
+batch_size = 1
+dtype = "float32"
+model_name = "resnet-18"
+log_file = "%s.log" % model_name
+
+# Set number of threads used for tuning based on the number of
+# physical cpu cores on your machine.
+num_threads = 1
+os.environ["TVM_NUM_THREADS"] = str(num_threads)
+
+
+#################################################################
+# Configure tensor tuning settings and create tasks
+# -------------------------------------------------
+# To get better kernel execution performance on x86 cpu,
+# we need to change data layout of convolution kernel from
+# "NCHW" to "NCHWc". To deal with this situation, we define
+# conv2d_NCHWc operator in topi. We will tune this operator
+# instead of plain conv2d.
+#
+# We will use local mode for tuning configuration. RPC tracker
+# mode can be setup similarly to the approach in
+# :ref:`tune_nnvm_arm` tutorial.
+
+tuning_option = {
+    'log_filename': log_file,
+    'tuner': 'random',
+    'early_stopping': None,
+
+    'measure_option': autotvm.measure_option(
+        builder=autotvm.LocalBuilder(),
+        runner=autotvm.LocalRunner(number=10, repeat=1,
+                                   min_repeat_ms=1000),
+    ),
+}
+
+# You can skip the implementation of this function for this tutorial.
+def tune_kernels(tasks,
+                 measure_option,
+                 tuner='gridsearch',
+                 early_stopping=None,
+                 log_filename='tuning.log'):
+
+    for i, tsk in enumerate(tasks):
+        prefix = "[Task %2d/%2d] " % (i+1, len(tasks))
+
+        # converting conv2d tasks to conv2d_NCHWc tasks
+        data, kernel, strides, padding, layout, dtype = tsk.args
+        kernel_size = (kernel[1][2], kernel[1][3])
+        data_plc = tvm.placeholder(data[1], name="data")
+        kernel_plc = tvm.placeholder(kernel[1], name="kernel")
+        args = [data_plc, kernel_plc, data[1][1], kernel_size, strides,
+                padding, layout, layout, dtype]
+        args = autotvm.task.nnvm_integration.serialize_args(args)
+        task = autotvm.task.create("topi_x86_conv2d_NCHWc", args=args, target=target)
+        task.workload = conv_NCHWc_arg_to_workload(data_plc, kernel_plc, kernel_size,
+                                                   strides, padding, layout, layout, dtype)
+
+        # create tuner
+        if tuner == 'xgb' or tuner == 'xgb-rank':
+            tuner_obj = XGBTuner(task, loss_type='rank')
+        elif tuner == 'ga':
+            tuner_obj = GATuner(task, pop_size=50)
+        elif tuner == 'random':
+            tuner_obj = RandomTuner(task)
+        elif tuner == 'gridsearch':
+            tuner_obj = GridSearchTuner(task)
+        else:
+            raise ValueError("Invalid tuner: " + tuner)
+
+        # do tuning
+        n_trial=len(task.config_space)
+        tuner_obj.tune(n_trial=n_trial,
+                       early_stopping=early_stopping,
+                       measure_option=measure_option,
+                       callbacks=[
+                           autotvm.callback.progress_bar(n_trial, prefix=prefix),
+                           autotvm.callback.log_to_file(log_filename)])
+
+
+########################################################################
+# Finally, we launch tuning jobs and evaluate the end-to-end performance.
+
+def tune_and_evaluate(tuning_opt):
+    # extract workloads from nnvm graph
+    print("Extract tasks...")
+    net, params, data_shape, out_shape = get_network(model_name, batch_size)
+    tasks = autotvm.task.extract_from_graph(net, target=target,
+                                            shape={'data': data_shape}, dtype=dtype,
+                                            symbols=(nnvm.sym.conv2d,))
+
+    # run tuning tasks
+    print("Tuning...")
+    tune_kernels(tasks, **tuning_opt)
+
+    # compile kernels with history best records
+    with autotvm.apply_history_best(log_file):
+        print("Compile...")
+        with nnvm.compiler.build_config(opt_level=3):
+            graph, lib, params = nnvm.compiler.build(
+                net, target=target, shape={'data': data_shape}, params=params, dtype=dtype)
+
+        # upload parameters to device
+        ctx = tvm.cpu()
+        data_tvm = tvm.nd.array((np.random.uniform(size=data_shape)).astype(dtype))
+        module = runtime.create(graph, lib, ctx)
+        module.set_input('data', data_tvm)
+        module.set_input(**params)
+
+        # evaluate
+        print("Evaluate inference time cost...")
+        ftimer = module.module.time_evaluator("run", ctx, number=100, repeat=3)
+        prof_res = np.array(ftimer().results) * 1000  # convert to millisecond
+        print("Mean inference time (std dev): %.2f ms (%.2f ms)" %
+              (np.mean(prof_res), np.std(prof_res)))
+
+# We do not run the tuning in our webpage server since it takes too long.
+# Uncomment the following line to run it by yourself.
+
+# tune_and_evaluate(tuning_option)
+
+######################################################################
+# Sample Output
+# -------------
+# The tuning needs to compile many programs and extract feature from them.
+# So a high performance CPU is recommended.
+# One sample output is listed below.
+#
+# .. code-block:: bash
+#
+#    Extract tasks...
+#    Tuning...
+#    [Task  1/12]  Current/Best:  598.05/2497.63 GFLOPS | Progress: (252/252) | 1357.95 s Done.
+#    [Task  2/12]  Current/Best:  522.63/2279.24 GFLOPS | Progress: (784/784) | 3989.60 s Done.
+#    [Task  3/12]  Current/Best:  447.33/1927.69 GFLOPS | Progress: (784/784) | 3869.14 s Done.
+#    [Task  4/12]  Current/Best:  481.11/1912.34 GFLOPS | Progress: (672/672) | 3274.25 s Done.
+#    [Task  5/12]  Current/Best:  414.09/1598.45 GFLOPS | Progress: (672/672) | 2720.78 s Done.
+#    [Task  6/12]  Current/Best:  508.96/2273.20 GFLOPS | Progress: (768/768) | 3718.75 s Done.
+#    [Task  7/12]  Current/Best:  469.14/1955.79 GFLOPS | Progress: (576/576) | 2665.67 s Done.
+#    [Task  8/12]  Current/Best:  230.91/1658.97 GFLOPS | Progress: (576/576) | 2435.01 s Done.
+#    [Task  9/12]  Current/Best:  487.75/2295.19 GFLOPS | Progress: (648/648) | 3009.95 s Done.
+#    [Task 10/12]  Current/Best:  182.33/1734.45 GFLOPS | Progress: (360/360) | 1755.06 s Done.
+#    [Task 11/12]  Current/Best:  372.18/1745.15 GFLOPS | Progress: (360/360) | 1684.50 s Done.
+#    [Task 12/12]  Current/Best:  215.34/2271.11 GFLOPS | Progress: (400/400) | 2128.74 s Done.
+#    Compile...
+#    Evaluate inference time cost...
+#    Mean inference time (std dev): 3.16 ms (0.03 ms)