virtual_machine.rst 14.9 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
..  Licensed to the Apache Software Foundation (ASF) under one
    or more contributor license agreements.  See the NOTICE file
    distributed with this work for additional information
    regarding copyright ownership.  The ASF licenses this file
    to you under the Apache License, Version 2.0 (the
    "License"); you may not use this file except in compliance
    with the License.  You may obtain a copy of the License at

..    http://www.apache.org/licenses/LICENSE-2.0

..  Unless required by applicable law or agreed to in writing,
    software distributed under the License is distributed on an
    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    KIND, either express or implied.  See the License for the
    specific language governing permissions and limitations
    under the License.

Putting the VM in TVM: The Relay Virtual Machine
================================================

Relay, a new program representation, has enabled the representation and optimization of
a great breadth of machine learning programs.
Unfortunately, by supporting a more expressive set of programs, we have
introduced several new execution challenges.

Relay's interpreter can execute the full language but has notable limitations
that make it unsuited for production deployments. It is structured as an inefficient
interpreter that performs AST traversal to execute the program. This approach is conceptually
simple but inefficient, as the AST traversal heavily relies on indirection.

There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation,
fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions
for these, but none is sufficiently compelling or optimized.

The second execution mechanism is the existing graph runtime. In order to target Relay
programs to this, we compile a small subset of them to the old graph format and execute
them on the runtime. Graph runtime provides a fast execution experience but only for a very limited
subset of Relay programs.

An alternative but not-standard approach is Relay's ahead-of-time compiler,
which compiles a Relay program into a shared library containing an ahead-
of-time implementation. The ahead-of-time compiler provides compelling performance
but is difficult to extend and instrument, which can only be done by modifying the
code generation and optimization mechanisms.

The Relay virtual machine is intended to be a framework that balances these competing
approaches, providing a dynamic execution environment which can be extended, instrumented,
and integrated with other approaches like ahead-of-time compilation via a flexible extension
mechanism.

The virtual machine is designed to strike a balance between performance and flexibility
when deploying and executing Relay programs, without giving up the benefits of TVM.

Virtual machine (VM) design is a well-studied area in programming languages and systems,
and there have been various virtual machine designs for both full-fledged
and embedded programing languages.
Previous language VM designs have been heavily tailored to the execution profile of traditional programs.
Traditional programs manipulate small scalar values and consist of a large number of low-level instructions.
The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient.
In the context of machine learning we manipulate primarily tensor values, using a (relatively)
low number of high level instructions. ML programs' cost centers are expensive operator invocations,
such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs,
micro-optimizations present in scalar VMs are dramatically less important.

TVM has provided strong support for vision models,
but we want to grow to support a wider variety of models.
The graph runtime is able to utilize the fully static nature of the input graphs to perform
aggressive optimization such as fully static allocation, and optimal memory reuse.
When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic
allocation, we must change how execution works. A virtual machine for Relay is a natural choice.

The rest of this document provides a high-level overview of the Relay
virtual machine design and its instruction set.

Design
------

The VM's design is focused on simplicity without sacrificing performance.
In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM.

In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation),
reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors).

Instruction Set
~~~~~~~~~~~~~~~

The choices of an instruction set and instruction representation are the most critical design decisions for a VM.
The current representation of the instructions is a tagged union containing the op-code and the data payload.  An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay.

Ret
^^^
**Arguments**:
::
  RegName dst
  RegName result

Zhi Chen committed
97
Returns the object in register ``result`` to caller's register ``dst``.
98 99 100 101 102

InvokePacked
^^^^^^^^^^^^
**Arguments**:
::
Zhi Chen committed
103 104 105
  Index packed_index
  Index arity
  Index output_size
106 107
  RegName* packed_args

Zhi Chen committed
108 109 110 111
Invoke the packed function denoted by ``packed_index``. The ``arity``
and ``output_size`` are used to inform the VM how many inputs and
outputs to expect. ``packed_args`` stores the list of argument registers. Note ``Index``
is an alais of ``int64_t``, and it will be used in other instructions as well.
112 113 114 115 116 117

AllocTensor
^^^^^^^^^^^
**Arguments**:
::
  RegName dst
Zhi Chen committed
118 119 120 121 122 123 124 125 126 127 128 129 130 131
  RegName storage
  uint32_t ndim
  int64_t* shape
  DLDataType dtype

Allocate a tensor value of using constant shape (stored in ``shape``) and ``dtype``
from the given storage block, ``storage``. The result is saved to register ``dst``.

AllocTensorReg
^^^^^^^^^^^^^^
**Arguments**:
::
  RegName dst
  RegName storage
132 133 134
  RegName shape_register
  DLDataType dtype

Zhi Chen committed
135 136 137 138 139 140 141 142 143 144 145 146 147 148
Allocate a tensor value of the appropriate shape (stored in ``shape_register``)
and ``dtype`` from the given storage block (stored in ``storage``). The result is saved to register ``dst``.

AllocStorage
^^^^^^^^^^^^
**Arguments**:
::
  RegName dst
  RegName size
  RegName alignment
  DLDataType dtype_hint

Allocate a storage block with the given ``size``, ``alignment`` and and data type, ``dtype_hint``.
The allocated storage block is stored in register ``dst``.
149

150
AllocADT
Zhi Chen committed
151
^^^^^^^^
152 153 154
**Arguments**:
::
  RegName dst
Zhi Chen committed
155 156
  Index tag
  Index num_fields
157 158
  RegName* datatype_fields

Zhi Chen committed
159 160
Allocate a data type with the tag ``tag`` using the ``num_fields`` entries
from registers ``datatype_fields``. The result is saved to register ``dst``.
161 162 163 164 165 166

AllocClosure
^^^^^^^^^^^^
**Arguments**:
::
  RegName dst
Zhi Chen committed
167 168
  Index clo_index
  Index num_freevar
169 170
  RegName* free_vars;

Zhi Chen committed
171 172 173
Allocate a closure with the VMFunction at ``clo_index`` as
its code, and the ``num_freevar`` entries from registers in
``free_vars``. The result is saved to register ``dst``.
174 175 176 177 178 179 180

GetField
^^^^^^^^
**Arguments**:
::
  RegName dst
  RegName object
Zhi Chen committed
181
  Index field_index
182

Zhi Chen committed
183
Get the field value with index ``field_index`` from ``object``. And saves the result to register ``dst``.
184 185 186 187 188

If
^^
**Arguments**:
::
189 190
  RegName test
  RegName target
Zhi Chen committed
191 192
  Index true_offset
  Index false_offset
193

Zhi Chen committed
194 195 196
Check if the object at register ``test`` is equal to ``target``.
If equal, relative jump by ``true_offset``, else relative
jump by ``false_offset``.
197

Zhi Chen committed
198 199
GetTag
^^^^^^
200 201 202 203 204
**Arguments**:
::
  RegName object
  RegName dst

Zhi Chen committed
205
Get the object tag for ADT object in register ``object``. And saves the reult to register ``dst``.
206 207 208 209 210

Fatal
^^^^^
Fail the virtual machine execution.

211 212 213 214
Goto
^^^^
**Arguments**:
::
Zhi Chen committed
215
  Index pc_offset
216

Zhi Chen committed
217
Relative unconditional jump by ``pc_offset``.
218 219 220 221 222

Invoke
^^^^^^
**Arguments**:
::
Zhi Chen committed
223
  Index func_index
224

Zhi Chen committed
225
Invoke function at ``func_index``, consumes the number of arguments contained in the VMFunction's
226 227 228 229 230 231 232
arity field.

InvokeClosure
^^^^^^^^^^^^^
**Arguments**:
::
    RegName closure
Zhi Chen committed
233
    Index num_closure_args
234 235
    RegName* closure_args

Zhi Chen committed
236
Invokes ``closure``, consuming the number of arguments declared in the closure's VMFunction.
237 238 239 240 241 242

LoadConst
^^^^^^^^^
**Arguments**:
::
  RegName dst
Zhi Chen committed
243
  Index const_index
244

Zhi Chen committed
245
Load the constant at ``const_index`` from the constant pool. The result is saved to register ``dst``.
246

247 248 249 250
LoadConsti
^^^^^^^^^^
**Arguments**:
::
Zhi Chen committed
251
  Index val
252 253
  RegName dst

Zhi Chen committed
254
Load the constant integer ``val`` to register ``dst``. The result is a 0-rank tensor.
255

256 257
Object Representation
~~~~~~~~~~~~~~~~~~~~~
Zhi Chen committed
258 259
We leverage the object protocol to represent the objects that are used by the
VM.
260

Zhi Chen committed
261 262 263 264
Currently, three types of objects, ``NDArray``, ``ADT``, and ``Closure`` objects, are used
to represent tensor, tuple/list, and closure data, respectively. More details
for each of them can be found at `include/tvm/runtime/ndarray.h`_,
`include/tvm/runtime/vm.h`_, and `include/tvm/runtime/container.h`_, respectively.
265

Zhi Chen committed
266
.. _include/tvm/runtime/ndarray.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/ndarray.h
267

Zhi Chen committed
268
.. _include/tvm/runtime/vm.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/vm.h
269

Zhi Chen committed
270
.. _include/tvm/runtime/container.h: https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/container.h
271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299

Stack and State
~~~~~~~~~~~~~~~

The Relay VM maintains a stack frame, which contains information about how to resume the
previous call. Registers are allocated in a continuous space (virtual register file) for each function.

We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).

::

    struct VirtualMachine {
      ...
      std::vector<VMFrame> frames;
      ...
      // Current function.
      size_t func_index;
      // Pointer into the current function's instructions.
      const Instruction* code;
      // Current program counter relative to the code pointer.
      size_t pc;
      ...
    };


Dispatch Loop
~~~~~~~~~~~~~
A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a
virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented
Zhi Chen committed
300
a simple ``switch``/``goto`` dispatch loop which dispatches based on instruction op code.
301

Zhi Chen committed
302
This loop is implemented by ``VirtualMachine::Run()``.
303 304 305 306 307

VM Compiler
~~~~~~~~~~~

An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
Zhi Chen committed
308 309
The VM compiler transforms a ``tvm::relay::Module`` into a ``tvm::relay::vm::Executable``. The executable
contains a set of compiled functions, the compiled functions are contained in ``tvm::relay::vm::Function``. The functions contain metadata about the the function as well as its compiled bytecode. The emitted executable object then can be loaded and run by a ``tvm::relay::vm::VirtualMachine`` object. For full definitions of the data structures, please see `include/tvm/runtime/vm.h`_.
310 311 312 313

Optimizations
~~~~~~~~~~~~~

Zhi Chen committed
314 315
There are quite a few optimizations required by the VM compiler. Each of them
is implemented as a pass which is managed by the Relay pass manager.
316 317 318 319

Optimizations marked with `TODO` are not implemented yet.

- A-Normal Form
Zhi Chen committed
320 321 322 323
- Lambda Lift (see `src/relay/vm/lambda_lift.cc`_)
- Inline Primitives (see `src/relay/vm/inline_primitives.cc`_)
- Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`_)
- ADT Tag Allocation (see `src/relay/backend/vm/compiler.cc`_)
324 325 326
- Tail Call Optimization (TODO)
- Liveness Analysis (TODO)

Zhi Chen committed
327 328 329 330 331 332
.. _src/relay/vm/lambda_lift.cc: https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/vm/lambda_lift.cc

.. _src/relay/vm/inline_primitives.cc: https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/vm/inline_primitives.cc

.. _src/relay/backend/vm/compiler.cc: https://github.com/apache/incubator-tvm/blob/master/src/relay/backend/vm/compiler.cc

333 334
Serialization
~~~~~~~~~~~~~
Zhi Chen committed
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367
Serializing and deserializing the executable generated by the Relay VM compiler is a must as
we may want to save the model to the disk and perform inference later. Previously, Relay has produced
a serialized form in a json file for the graph runtime. However, the same format is not directly
applicable to the VM as it emits bytecode instead of graph-style programs.
Serialization of an executable essentially needs to handle both model specific
(i.e. weights and kernels) and VM related (i.e. bytecode and global function names) data.

For kernels, we can conveniently leverage existing TVM infra to save and load
the compiled library module. Here we only focus on serializing other several
components in a binary format that is organized with the following sections in order.

- Global section. This section contains the globals (function names) used by the virtual machine.

- Constant section. This section is used to store the constant pool (i.e. weights of the model)
  for a virtual machine.

- Primitive name section. This section is introduced to accommodate the list of primitive
  operator names that will be invoked by the virtual machine, i.e. the names
  starting with ``fused_``. The primitive names are used as symbols to look up
  function pointers in the compiled kernel library.

- Code section. The VM functions, including bytecode, are sitting in this section. The dispatching
  loop iterates through this section to fetch instructions for execution.

Hence, unlike the graph runtime artifact that contains weight (.params), graph json (.json),
and compiled kernel library (.so), the serialized executable artifact is composed of the Relay
object file (.ro) and the compiled kernel library (.so).

A ``save`` function is implemented to store the executable to the disk and
serialize it into the above format. Meanwhile, a ``load_exec`` function is used to
load the serialized kernel binary and executable related binary code, which will be again used to
instantiate a VM object. Please refer to the `test_vm_serialization.py`_ file for more
examples.
368

Zhi Chen committed
369
.. _test_vm_serialization.py: https://github.com/apache/incubator-tvm/blob/master/tests/python/relay/test_vm_serialization.py
370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386

Unresolved Questions
~~~~~~~~~~~~~~~~~~~~

How do we handle dynamic shapes?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TODO

How can we modify the VM to support JIT compilation of certain code paths?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
to be very flexible so we can modify it for future experiments.

How do we support heterogenous execution?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
In order to do this properly we need to run the device annotation and copying passes.