[Relay] Add gradient operator tutorial docs (#2751)

* Add gradient operator tutorial docs * Incorporate Steven's and Ziheng's feedback * Remove TODO about `collapse_sum_like` * Add more examples

[Relay] Add gradient operator tutorial docs (#2751)
* Add gradient operator tutorial docs * Incorporate Steven's and Ziheng's feedback * Remove TODO about `collapse_sum_like` * Add more examples
b9349cb0 · Logan Weber · Jared Roesch · 2aa3c6c9 · b9349cb0 · b9349cb0
Commit b9349cb0 authored Apr 12, 2019 by Logan Weber Committed by Jared Roesch Apr 12, 2019
Hide whitespace changes
Inline Side-by-side

Showing with 109 additions and 0 deletions

docs/dev/relay_add_op.rst
+104 -0

src/relay/pass/pattern_util.h
+5 -0

No files found.
--- a/docs/dev/relay_add_op.rst
+++ b/docs/dev/relay_add_op.rst
@@ -156,6 +156,110 @@ before producing the call node:
        tup = Tuple(list(args))
        return _make.concat(tup)

+Gradient Operators
+------------------
+
+Gradient operators are important for writing differentiable programs in
+Relay. While it is the case that Relay's autodiff algorithm can differentiate
+first-class language constructs, operators are opaque. Because Relay can't
+look into the implementation, an explicit differentiation rule must be
+provided.
+
+Both Python and C++ can be used to write gradient operators, but we focus our
+examples on Python, as it is more commonly used.
+
+Adding a Gradient in Python
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+A collection of Python gradient operators can be found in
+``python/tvm/relay/op/_tensor_grad.py``. We will walk through two
+representative examples: ``sigmoid`` and ``multiply``.
+
+.. code:: python
+
+    @register_gradient("sigmoid")
+    def sigmoid_grad(orig, grad):
+        """Returns [grad * sigmoid(x) * (1 - sigmoid(x))]."""
+        return [grad * orig * (ones_like(orig) - orig)]
+
+The inputs here are the original operator ``orig`` and a gradient ``grad`` to
+accumulate into. What we return is a list, where the element at the i'th
+index is the derivative of the operator with respect to the operator's i'th
+input. In general, the gradient will return a list with as many elements as
+there are inputs to the base operator.
+
+Before we further analyze this definition, first we should recall the
+derivative of the sigmoid function: :math:`\frac{\partial \sigma}{\partial x}
+= \sigma(x)(1 - \sigma(x))`. The definition above looks similar to the
+mathematical definition, but there is one important addition, which we
+describe below.
+
+The term ``orig * (ones_like(orig) - orig)`` directly matches the derivative,
+because ``orig`` here is the sigmoid function, but we're not just interested
+in how to compute the gradient of this function. We're interested in
+composing this gradient with other gradients, so we can accumulate the
+gradient across an entire program. This is where the ``grad`` term comes in.
+In the expression ``grad * orig * (ones_like(orig) - orig)``, multiplying by
+``grad`` specifies how to compose the derivative with the gradient thus far.
+
+Now, we consider ``multiply``, a slightly more interesting example:
+
+.. code:: python
+
+    @register_gradient("multiply")
+    def multiply_grad(orig, grad):
+        """Returns [grad * y, grad * x]"""
+        x, y = orig.args
+        return [collapse_sum_like(grad * y, x),
+                collapse_sum_like(grad * x, y)]
+
+In this example, there are two elements in the returned list, because
+``multiply`` is a binary operator. And to recall, if :math:`f(x, y) = xy`, the
+partial derivatives are :math:`\frac{\partial f}{\partial x} = y` and
+:math:`\frac{\partial f}{\partial y} = x`.
+
+There is one required step for ``multiply`` that is not required for
+``sigmoid``, because ``multiply`` has broadcasting semantics. Since the shape
+of ``grad`` might not match the shape of the inputs, we use
+``collapse_sum_like`` to take the contents of the ``grad * <var>`` terms and
+make the shape match the shape of the input we're differentiating with
+respect to.
+
+Adding a Gradient in C++
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Adding a gradient in C++ is similar to adding one in Python, but the
+interface for registering is slightly different.
+
+First, make sure ``src/relay/pass/pattern_util.h`` is included. It provides
+helper functions for creating nodes in the Relay AST. Then, define the
+gradient in a similar fashion as in the Python example:
+
+.. code:: c
+
+    tvm::Array<Expr> MultiplyGrad(const Expr& orig_call, const Expr& output_grad) {
+        const Call& call = orig_call.Downcast<Call>();
+        return { CollapseSumLike(Multiply(output_grad, call.args[1]), call.args[0]),
+                 CollapseSumLike(Multiply(output_grad, call.args[0]), call.args[1]) };
+    }
+
+Notice that in C++ we can't use the same operator overloading that we have in
+Python, and we need to downcast, so the implementation is more verbose. Even
+so, we can easily verify that this definition mirrors the earlier example in
+Python.
+
+Now, instead of using a Python decorator, we need to tack a ``set_attr`` call
+for "FPrimalGradient" onto the end of the base operator's registration, in
+order to register the gradient.
+
+.. code:: c
+
+    RELAY_REGISTER_OP("multiply")
+        // ...
+        // Set other attributes
+        // ...
+        .set_attr<FPrimalGradient>("FPrimalGradient", MultiplyGrad);
+
 Summary
 -------


--- a/src/relay/pass/pattern_util.h
+++ b/src/relay/pass/pattern_util.h
@@ -328,6 +328,11 @@ inline Expr OnesLike(Expr e) {
  return CallNode::make(op, {e});
 }

+inline Expr CollapseSumLike(Expr e) {
+  static const Op& op = Op::Get("collapse_sum_like");
+  return CallNode::make(op, {e});
+}
+
 inline Expr Power(Expr lhs, Expr rhs) {
  static const Op& op = Op::Get("power");
  return CallNode::make(op, {lhs, rhs}, Attrs(), {});