Commit 0b54952b by Yida Wang Committed by Tianqi Chen

minor tweak of the runtime doc to fix some grammatical and expression issues (#828)

parent 80f9ca00
# TVM Runtime System # TVM Runtime System
TVM support multiple programming languages for compiler stack development and deployment. TVM supports multiple programming languages for the compiler stack development and deployment.
In this note, we explain the key element of TVM runtime. In this note, we explain the key elements of the TVM runtime.
![](http://www.tvmlang.org/images/release/tvm_flexible.png) ![](http://www.tvmlang.org/images/release/tvm_flexible.png)
...@@ -9,10 +9,10 @@ We need to satisfy quite a few interesting requirements ...@@ -9,10 +9,10 @@ We need to satisfy quite a few interesting requirements
- Deployment: invoke the compiled function from python/javascript/c++ language. - Deployment: invoke the compiled function from python/javascript/c++ language.
- Debug: define a function in python and call that from a compiled function. - Debug: define a function in python and call that from a compiled function.
- Link: write driver code to call device specific code(CUDA) and call it from compiled host function. - Link: write driver code to call device specific code (CUDA) and call it from compiled host function.
- Prototype: define an IR pass from python and call that from C++ backend. - Prototype: define an IR pass from python and call that from C++ backend.
- Expose: compiler stack developed in c++ to front-end (i.e, python) - Expose: compiler stack developed in c++ to front-end (i.e, python)
- Experiment: ship a compiled function to an embedded device directly run there. - Experiment: ship a compiled function to an embedded device to directly run there.
We want to be able to define a function from any language and call from another. We want to be able to define a function from any language and call from another.
We also want the runtime core to be minimal to deploy to embedded devices. We also want the runtime core to be minimal to deploy to embedded devices.
...@@ -41,9 +41,9 @@ void CallPacked() { ...@@ -41,9 +41,9 @@ void CallPacked() {
``` ```
In the above codeblock, we defined a PackedFunc MyAdd. It takes two arguments In the above codeblock, we defined a PackedFunc MyAdd. It takes two arguments
: ```args``` represents input arguments and ```rv``` represents return value. : ```args``` represents input arguments and ```rv``` represents return value.
The function is type-erased, which means the function signature does not restrict which input type to pass in or type to return. The function is type-erased, which means that the function signature does not restrict which input type to pass in or type to return.
Under the hood, when we call a PackedFunc, it packs the input arguments to TVMArgs on stack, Under the hood, when we call a PackedFunc, it packs the input arguments to TVMArgs on stack,
and get the result back via TVMRetValue. and gets the result back via TVMRetValue.
Thanks to template tricks in C++, we can call a PackedFunc just like a normal function. Because of its type-erased nature, we can call a PackedFunc from dynamic languages like python, without additional glue code for each new type function created. Thanks to template tricks in C++, we can call a PackedFunc just like a normal function. Because of its type-erased nature, we can call a PackedFunc from dynamic languages like python, without additional glue code for each new type function created.
The following example registers PackedFunc in C++ and calls from python. The following example registers PackedFunc in C++ and calls from python.
...@@ -74,7 +74,7 @@ The restriction makes the implementation simple without the need of serializatio ...@@ -74,7 +74,7 @@ The restriction makes the implementation simple without the need of serializatio
Despite being minimum, the PackedFunc is sufficient for the use-case of deep learning deployment as Despite being minimum, the PackedFunc is sufficient for the use-case of deep learning deployment as
most functions only take DLTensor or numbers. most functions only take DLTensor or numbers.
Since one PackedFunc can take another PackedFunc as argument, Since one PackedFunc can take another PackedFunc as an argument,
we can pass functions from python(as PackedFunc) to C++. we can pass functions from python(as PackedFunc) to C++.
```c++ ```c++
TVM_REGISTER_GLOBAL("callhello") TVM_REGISTER_GLOBAL("callhello")
...@@ -97,15 +97,15 @@ callhello(f) ...@@ -97,15 +97,15 @@ callhello(f)
``` ```
TVM provides a [minimum C API](https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h), TVM provides a [minimum C API](https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h),
that allows us to embedded the PackedFunc into any languages. Besides python, so far we supported which allows us to embed the PackedFunc into any languages. Besides python, so far we supported
[java](https://github.com/dmlc/tvm/tree/master/jvm) and [javascript](https://github.com/dmlc/tvm/tree/master/web). [java](https://github.com/dmlc/tvm/tree/master/jvm) and [javascript](https://github.com/dmlc/tvm/tree/master/web).
This philosophy of embedded API is very like Lua, except that we don't have a new language and uses C++. This philosophy of embedded API is very like Lua, except that we don't have a new language but use C++.
One fun fact about PackedFunc is that we use it for both compiler and deployment stack. One fun fact about PackedFunc is that we use it for both compiler and deployment stack.
- All TVM's compiler pass functions are exposed to frontend as PackedFunc, see [here](https://github.com/dmlc/tvm/tree/master/src/api) - All TVM's compiler pass functions are exposed to frontend as PackedFunc, see [here](https://github.com/dmlc/tvm/tree/master/src/api)
- The compiled modules also returns compiled function as PackedFunc - The compiled module also returns the compiled function as PackedFunc
To keep the runtime minimum, we isolated the IR Node support from the deployment runtime. The resulting runtime takes around 200K - 600K depending on how many runtime driver modules(e.g., CUDA) get included. To keep the runtime minimum, we isolated the IR Node support from the deployment runtime. The resulting runtime takes around 200K - 600K depending on how many runtime driver modules (e.g., CUDA) get included.
The overhead of calling into PackedFunc vs. a normal function is small, as it is only saving a few values on the stack. The overhead of calling into PackedFunc vs. a normal function is small, as it is only saving a few values on the stack.
So it is OK as long as we don't wrap small functions. So it is OK as long as we don't wrap small functions.
...@@ -113,9 +113,9 @@ In summary, the PackedFunc is the universal glue in TVM where we use it extensiv ...@@ -113,9 +113,9 @@ In summary, the PackedFunc is the universal glue in TVM where we use it extensiv
## Module ## Module
Since TVM support multiple types of devices, we need to support different type of drivers. Since TVM supports multiple types of devices, we need to support different type of drivers.
We have to use driver API to load the kernel, set up the argument in packed format and perform kernel launch. We have to use the driver API to load the kernel, set up the argument in packed format and perform kernel launch.
We also need to patch up the driver API so that the exposed functions is threadsafe. We also need to patch up the driver API so that the exposed functions are threadsafe.
So we often need to implement these driver glues in C++ and expose them to the user. So we often need to implement these driver glues in C++ and expose them to the user.
We can certainly not do it for each type of functions, so again PackedFunc is our answer. We can certainly not do it for each type of functions, so again PackedFunc is our answer.
...@@ -130,32 +130,32 @@ of new device easy, and we do not need to redo the host code generation for each ...@@ -130,32 +130,32 @@ of new device easy, and we do not need to redo the host code generation for each
## Remote Deployment ## Remote Deployment
The PackedFunc and Module system also makes it easy to ship the function into remote devices directly. The PackedFunc and Module system also makes it easy to ship the function into remote devices directly.
Under the hood, we have a RPCModule that serializes the arguments and do the data movement and launches the computation on the remote. Under the hood, we have an RPCModule that serializes the arguments to do the data movement and launches the computation on the remote.
![](http://www.tvmlang.org/images/release/tvm_rpc.png) ![](http://www.tvmlang.org/images/release/tvm_rpc.png)
The RPC server itself is minimum and can be bundled into the runtime. We can start a minimum TVM The RPC server itself is minimum and can be bundled into the runtime. We can start a minimum TVM
RPC server on iPhone/android/raspberry pi or even your browser. The cross compilation on server and shipping of the module for testing can be done in the same script. Checkout RPC server on iPhone/android/raspberry pi or even the browser. The cross compilation on server and shipping of the module for testing can be done in the same script. Checkout
[Cross compilation and RPC tutorial](http://docs.tvmlang.org/tutorials/deployment/cross_compilation_and_rpc.html#sphx-glr-tutorials-deployment-cross-compilation-and-rpc-py) for more details. [Cross compilation and RPC tutorial](http://docs.tvmlang.org/tutorials/deployment/cross_compilation_and_rpc.html#sphx-glr-tutorials-deployment-cross-compilation-and-rpc-py) for more details.
This instant feedback gives us a lot of advantages. For example, to test the correctness of generated code on iPhone, we no longer have to write test-cases in swift/objective-c from scratch -- We can use RPC to execute on iPhone copy the result back and do verification on my host via numpy. We can also do the profiling using the same script. This instant feedback gives us a lot of advantages. For example, to test the correctness of generated code on iPhone, we no longer have to write test-cases in swift/objective-c from scratch -- We can use RPC to execute on iPhone, copy the result back and do verification on the host via numpy. We can also do the profiling using the same script.
## TVM Node and Compiler Stack ## TVM Node and Compiler Stack
As we mentioned earlier, we build compiler stack API on top of the PackedFunc runtime system. As we mentioned earlier, we build compiler stack API on top of the PackedFunc runtime system.
We faced a constant changing the compiler API for the need of research. We need a new language object or IR node from now and then when we want to test out new primitives. We faced a constant changing of the compiler API for the need of research. We need a new language object or IR node whenever we want to test out new primitives.
However, we don't want to change our API from time to time. Besides that, we also want to However, we don't want to change our API from time to time. Besides that, we also want to
- be able to serialize any language object and IRs - be able to serialize any language object and IRs
- be able to explore, print, and manipulate the IR objects in front-end language to do quick prototyping. - be able to explore, print, and manipulate the IR objects in front-end language to do quick prototyping.
We introduced a base class, called [Node](https://github.com/dmlc/HalideIR/blob/master/src/tvm/node.h#L52) to solve this problem. We introduced a base class, called [Node](https://github.com/dmlc/HalideIR/blob/master/src/tvm/node.h#L52) to solve this problem.
All the language object in compiler stack is a subclass of Node. Each node contains a string type_key that uniquely identifies All the language object in the compiler stack is a subclass of Node. Each node contains a string type_key that uniquely identifies
the type of object. We choose string instead of int as type key so new Node class can be added in decentralized fashion without the type of object. We choose string instead of int as type key so new Node class can be added in the decentralized fashion without
adding the code back to the central repo. To ease the speed of dispatching, we allocate an integer type_index at runtime for each type_key. adding the code back to the central repo. To ease the speed of dispatching, we allocate an integer type_index at runtime for each type_key.
Since usually one Node object could be referenced in multiple places in the language. We use a shared_ptr to keep Since usually one Node object could be referenced in multiple places in the language, we use a shared_ptr to keep
track of reference. We use NodeRef class to represents a reference to the Node. track of reference. We use NodeRef class to represent a reference to the Node.
We can roughly view NodeRef class as shared_ptr to the Node container. We can roughly view NodeRef class as shared_ptr to the Node container.
We can also define subclass NodeRef to hold each subtypes of Node. Each Node class needs to define the VisitAttr function. We can also define subclass NodeRef to hold each subtypes of Node. Each Node class needs to define the VisitAttr function.
...@@ -206,7 +206,7 @@ class TensorNode : public Node { ...@@ -206,7 +206,7 @@ class TensorNode : public Node {
``` ```
In the above examples, both ```Operation``` and ```Array<Expr>``` are NodeRef. In the above examples, both ```Operation``` and ```Array<Expr>``` are NodeRef.
The VisitAttrs gives us a reflection API to visit each member of the object. The VisitAttrs gives us a reflection API to visit each member of the object.
We can use this function to visit the node any serialize any language object recursively. We can use this function to visit the node and serialize any language object recursively.
It also allows us to get members of an object easily in front-end language. It also allows us to get members of an object easily in front-end language.
For example, in the following code, we accessed the op field of the TensorNode. For example, in the following code, we accessed the op field of the TensorNode.
...@@ -220,13 +220,13 @@ print(x.op.name) ...@@ -220,13 +220,13 @@ print(x.op.name)
New Node can be added to C++ without changing the front-end runtime, making it easy to make extensions to the compiler stack. New Node can be added to C++ without changing the front-end runtime, making it easy to make extensions to the compiler stack.
Note that this is not the fastest way to expose members to front-end language, but might be one of the simplest Note that this is not the fastest way to expose members to front-end language, but might be one of the simplest
approach possible. We also find it fits our purposes as we mainly use python for testing and prototyping and still use c++ approaches possible. We also find that it fits our purposes as we mainly use python for testing and prototyping and still use c++
to do the heavy lifting job. to do the heavy lifting job.
## Implementation Details ## Implementation Details
Each argument in PackedFunc contains a union value [TVMValue](https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h#L122) Each argument in PackedFunc contains a union value [TVMValue](https://github.com/dmlc/tvm/blob/master/include/tvm/runtime/c_runtime_api.h#L122)
and a type code. This design allows the dynamically typed language to convert to the corresponding type directly, and statically typed language and a type code. This design allows the dynamically typed language to convert to the corresponding type directly, and statically typed language to
do runtime type checking during conversion. do runtime type checking during conversion.
The relevant files are The relevant files are
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment