runtime.rst 13 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
..  Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

..    http://www.apache.org/licenses/LICENSE-2.0

..  Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.
17 18 19 20 21

.. _tvm-runtime-system:

TVM Runtime System
==================
22

23 24
TVM supports multiple programming languages for the compiler stack development and deployment.
In this note, we explain the key elements of the TVM runtime.
25

26
.. image:: http://www.tvm.ai/images/release/tvm_flexible.png
27

28
We need to satisfy quite a few interesting requirements:
29 30 31

- Deployment: invoke the compiled function from python/javascript/c++ language.
- Debug: define a function in python and call that from a compiled function.
32
- Link: write driver code to call device specific code (CUDA) and call it from compiled host function.
33
- Prototype: define an IR pass from python and call that from C++ backend.
34 35
- Expose: compiler stack developed in c++ to front-end (i.e, python)
- Experiment: ship a compiled function to an embedded device to directly run there.
36 37 38 39

We want to be able to define a function from any language and call from another.
We also want the runtime core to be minimal to deploy to embedded devices.

40 41
PackedFunc
----------
42

43
`PackedFunc`_ is a simple but elegant solution
44 45
we find to solve the challenges listed. The following code block provides an example in C++

46
.. _PackedFunc: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/packed_func.h
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

.. code:: c

    #include <tvm/runtime/packed_func.h>

    void MyAdd(TVMArgs args, TVMRetValue* rv) {
      // automatically convert arguments to desired type.
      int a = args[0];
      int b = args[1];
      // automatically assign value return to rv
      *rv = a + b;
    }

    void CallPacked() {
      PackedFunc myadd = PackedFunc(MyAdd);
      // get back 3
      int c = myadd(1, 2);
    }

66
In the above codeblock, we defined a PackedFunc MyAdd. It takes two arguments
67
: ``args`` represents input arguments and ``rv`` represents return value.
68
The function is type-erased, which means that the function signature does not restrict which input type to pass in or type to return.
69
Under the hood, when we call a PackedFunc, it packs the input arguments to TVMArgs on stack,
70
and gets the result back via TVMRetValue.
71

72
Thanks to template tricks in C++, we can call a PackedFunc just like a normal function. Because of its type-erased nature, we can call a PackedFunc from dynamic languages like python, without additional glue code for each new type function created.
73 74
The following example registers PackedFunc in C++ and calls from python.

75
.. code:: c
76

77 78 79
    // register a global packed function in c++
    TVM_REGISTER_GLOBAL("myadd")
    .set_body(MyAdd);
80

81 82 83 84 85 86 87 88 89 90 91
.. code:: python

    import tvm

    myadd = tvm.get_global_func("myadd")
    # prints 3
    print(myadd(1, 2))

Most of the magic of PackedFunc lies in ``TVMArgs`` and ``TVMRetValue`` structure.
We restrict a list of possible types which can be passed.
Here are the common ones:
92 93 94 95 96

- int, float and string
- PackedFunc itself
- Module for compiled modules
- DLTensor* for tensor object exchange
97
- TVM Object to represent any object in IR
98 99 100 101 102

The restriction makes the implementation simple without the need of serialization.
Despite being minimum, the PackedFunc is sufficient for the use-case of deep learning deployment as
most functions only take DLTensor or numbers.

103
Since one PackedFunc can take another PackedFunc as an argument,
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
we can pass functions from python (as PackedFunc) to C++.

.. code:: c

    TVM_REGISTER_GLOBAL("callhello")
    .set_body([](TVMArgs args, TVMRetValue* rv) {
      PackedFunc f = args[0];
      f("hello world");
    });

.. code:: python

    import tvm

    def callback(msg):
      print(msg)

    # convert to PackedFunc
    f = tvm.convert(callback)
    callhello = tvm.get_global_func("callhello")
    # prints hello world
    callhello(f)

TVM provides a `minimum C API`_,
128
which allows us to embed the PackedFunc into any languages. Besides python, so far we supported
129
`java`_ and `javascript`_.
130
This philosophy of embedded API is very like Lua, except that we don't have a new language but use C++.
131

132 133 134
.. _minimum C API: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/c_runtime_api.h
.. _java: https://github.com/apache/incubator-tvm/tree/master/jvm
.. _javascript: https://github.com/apache/incubator-tvm/tree/master/web
135 136


137
One fun fact about PackedFunc is that we use it for both compiler and deployment stack.
138 139

- All TVM's compiler pass functions are exposed to frontend as PackedFunc, see `here`_
140
- The compiled module also returns the compiled function as PackedFunc
141

142
.. _here: https://github.com/apache/incubator-tvm/tree/master/src/api
143

144
To keep the runtime minimum, we isolated the IR Object support from the deployment runtime. The resulting runtime takes around 200K - 600K depending on how many runtime driver modules (e.g., CUDA) get included.
145 146 147 148 149

The overhead of calling into PackedFunc vs. a normal function is small, as it is only saving a few values on the stack.
So it is OK as long as we don't wrap small functions.
In summary, the PackedFunc is the universal glue in TVM where we use it extensively to support our compiler and deployment.

150 151
Module
------
152

153 154 155
Since TVM supports multiple types of devices, we need to support different type of drivers.
We have to use the driver API to load the kernel, set up the argument in packed format and perform kernel launch.
We also need to patch up the driver API so that the exposed functions are threadsafe.
156 157 158
So we often need to implement these driver glues in C++ and expose them to the user.
We can certainly not do it for each type of functions, so again PackedFunc is our answer.

159
TVM defines the compiled object as `Module`_.
160 161 162
The user can get the compiled function from Module as PackedFunc.
The generated compiled code can dynamically get function from Module in runtime. It caches the function handle in the first call and reuses in subsequent calls. We use this to link device code and callback into any PackedFunc(e.g., python) from generated code.

163
.. _Module: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/module.h
164

165 166 167 168
The ModuleNode is an abstract class that can be implemented by each type of device.
So far we support modules for CUDA, Metal, OpenCL and loading dynamic shared libraries. This abstraction makes introduction
of new device easy, and we do not need to redo the host code generation for each type of device.

169 170
Remote Deployment
-----------------
171 172

The PackedFunc and Module system also makes it easy to ship the function into remote devices directly.
173
Under the hood, we have an RPCModule that serializes the arguments to do the data movement and launches the computation on the remote.
174

175
.. image:: http://www.tvm.ai/images/release/tvm_rpc.png
176 177

The RPC server itself is minimum and can be bundled into the runtime. We can start a minimum TVM
178
RPC server on iPhone/android/raspberry pi or even the browser. The cross compilation on server and shipping of the module for testing can be done in the same script. Checkout
179 180
`Cross compilation and RPC tutorial`_ for more details.

181
.. _Cross compilation and RPC tutorial: https://docs.tvm.ai/tutorials/cross_compilation_and_rpc.html#sphx-glr-tutorials-cross-compilation-and-rpc-py
182

183
This instant feedback gives us a lot of advantages. For example, to test the correctness of generated code on iPhone, we no longer have to write test-cases in swift/objective-c from scratch -- We can use RPC to execute on iPhone, copy the result back and do verification on the host via numpy. We can also do the profiling using the same script.
184

185
TVM Object and Compiler Stack
186
-----------------------------
187 188

As we mentioned earlier, we build compiler stack API on top of the PackedFunc runtime system.
189
We faced a constant changing of the compiler API for the need of research. We need a new language object or IR node whenever we want to test out new primitives.
190 191 192 193 194
However, we don't want to change our API from time to time. Besides that, we also want to

- be able to serialize any language object and IRs
- be able to explore, print, and manipulate the IR objects in front-end language to do quick prototyping.

195 196 197
We introduced a base class, called `Object`_ to solve this problem.
All the language object in the compiler stack is a subclass of ``Object``. Each object contains a string type_key that uniquely identifies
the type of object. We choose string instead of int as type key so new ``Object`` class can be added in the decentralized fashion without
198 199
adding the code back to the central repo. To ease the speed of dispatching, we allocate an integer type_index at runtime for each type_key.

200
.. _Object: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/object.h
201

202 203 204 205
Since usually one ``Object`` could be referenced in multiple places in the language, we use a shared_ptr to keep
track of reference. We use ``ObjectRef`` class to represent a reference to the ``Object``.
We can roughly view ``ObjectRef`` class as shared_ptr to the ``Object`` container.
We can also define subclass ``ObjectRef`` to hold each subtypes of ``Object``. Each subclass of ``Object`` needs to define the VisitAttr function.
206

207 208 209 210 211 212 213 214 215 216 217 218
.. code:: c

    class AttrVisitor {
    public:
      virtual void Visit(const char* key, double* value) = 0;
      virtual void Visit(const char* key, int64_t* value) = 0;
      virtual void Visit(const char* key, uint64_t* value) = 0;
      virtual void Visit(const char* key, int* value) = 0;
      virtual void Visit(const char* key, bool* value) = 0;
      virtual void Visit(const char* key, std::string* value) = 0;
      virtual void Visit(const char* key, void** value) = 0;
      virtual void Visit(const char* key, Type* value) = 0;
219
      virtual void Visit(const char* key, ObjectRef* value) = 0;
220 221 222
      // ...
    };

223
    class BaseAttrsNode : public Object {
224
    public:
225
      virtual void VisitAttrs(AttrVisitor* v) {}
226 227
      // ...
    };
228

229
Each ``Object`` subclass will override this to visit its members. Here is an example implementation of TensorNode.
230 231 232

.. code:: c

233
    class TensorNode : public Object {
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253
    public:
      /*! \brief The shape of the tensor */
      Array<Expr> shape;
      /*! \brief data type in the content of the tensor */
      Type dtype;
      /*! \brief the source operation, can be None */
      Operation op;
      /*! \brief the output index from source operation */
      int value_index{0};
      /*! \brief constructor */
      TensorNode() {}

      void VisitAttrs(AttrVisitor* v) final {
        v->Visit("shape", &shape);
        v->Visit("dtype", &dtype);
        v->Visit("op", &op);
        v->Visit("value_index", &value_index);
      }
    };

254
In the above examples, both ``Operation`` and ``Array<Expr>`` are ObjectRef.
255
The VisitAttrs gives us a reflection API to visit each member of the object.
256
We can use this function to visit the node and serialize any language object recursively.
257 258 259
It also allows us to get members of an object easily in front-end language.
For example, in the following code, we accessed the op field of the TensorNode.

260
.. code:: python
261

262
    import tvm
263
    from tvm import te
264

265
    x = te.placeholder((3,4), name="x")
266 267
    # access the op field of TensorNode
    print(x.op.name)
268

269
New ``Object`` can be added to C++ without changing the front-end runtime, making it easy to make extensions to the compiler stack.
270
Note that this is not the fastest way to expose members to front-end language, but might be one of the simplest
271
approaches possible. We also find that it fits our purposes as we mainly use python for testing and prototyping and still use c++
272 273
to do the heavy lifting job.

274 275
Implementation Details
----------------------
276

277
Each argument in PackedFunc contains a union value `TVMValue`_
278
and a type code. This design allows the dynamically typed language to convert to the corresponding type directly, and statically typed language to
279 280
do runtime type checking during conversion.

281
.. _TVMValue: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/c_runtime_api.h#L122
282

283
The relevant files are
284 285 286 287

- `packed_func.h`_ for C++ API
- `c_runtime_api.cc`_ for C API and how to provide callback.

288 289
.. _packed_func.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/packed_func.h
.. _c_runtime_api.cc: https://github.com/apache/incubator-tvm/blob/master/src/runtime/c_runtime_api.cc#L262
290 291

To support extension types, we used a registry system to register type related information, like support of any
292 293
in C++, see `Extension types`_ for more details.

294
.. _Extension types: https://github.com/apache/incubator-tvm/tree/master/apps/extension