Deep Learning, from a Programming Language Perspective

I bet you had heard of Artificial Neural Network (if not, go read it now), but specifically, what is back propagation and the intuition behind it? I will try to explain concept in Deep Learning from a Programming Language perspective, and point out what we can get from doing so. At the same time, I will also talk about the Deep Learning framework I developed, DeepDarkFantasy(DDF), to show how we can leverage concept from PL.

What is a neural network?

In one sentence, A Neural Network is just a program with unknown parameters! Those unknown parameters, are the weight of the neural network. Forward propagation is evaluation of the program. Deep Learning is the process of finding the best parameters.

For example, let’s say we have the following Neural Network(and assuming the activation function is ReLU):

Input 0 Output 0
Input 1 /\

The Scala code equivalent is:

def nn(in0 : Double)(in1 : Double)(w0 : Double)(w1 : Double) = max(in0 * w0 + in1 * w1, 0)

How to Train Your Neural Network

What should we do to find the best parameter?

Obviously, we need to define ‘best’. We can introduce a function that, given weights, use the weights to run the neural network, and give a score. The best weights maximize/minimize the score. The function is called Loss Function.

Assuming neural network have only a weight, a double:

type Weight = Double

LossFunction is all function that take a weight and return a double

type LossFunction = Weight => Double

A typical loss function is Mean Square Error: for every dimension of a multi-dimensional output by neural network and the expected output, square the difference. Divide the result by the number of dimension. For one-dimensional output, this is just the square of difference. We can write it down as a Scala function.

def MSE(l: Seq[Double])(r: Seq[Double]) = ( => p._1 * p._2).sum: Double) / l.length

This have different type then the LossFunction defined above. However, given a data set of the type Seq[(Input, Output)], we can turn it into a loss function: given weight, for every (input, output) on the data set, run the neural network with weight and input. Compare the output with the expected output. Sum all the absolute difference.

It is possible to transform a LossFunction, for example scaling it:

def Scale(d: Double)(lf: LossFunction): LossFunction = w => d * lf(w)

Switching between gradient descend or gradient ascend is Scale -1.

More transformation - Adding two LossFunction:

def Plus(l: LossFunction)(r: LossFunction): LossFunction = w => l(w) + r(w)

There are other LossFunction, for example L1 L2 regularization, for preventing large weights:

def L1: LossFunction = w => if (w > 0) w else -w
def L2: LossFunction = w => w * w

To add L2 onto a existing LossFunction, we might first scale L2 with a hyperparameter to adjust the regularization rate, and add he original loss.

Back to our main point: the most common method to minimize the Loss Function, is:

  1. Find some random value as initial weight
  2. Calculate the derivative for the Loss(with multiple weight this is a gradient)
  3. Reduce it if derivative is positive, increase it otherwise.(with multiple weight update to the backward direction of gradient)
  4. Repeat till you feel like it

This is called Gradient Descent, the backbone for a ton of training algorithm in Deep Learning. Back Propagation, is simply the process of finding derivative and updating.

Where are derivative from?

Sorting by the level of automation:

  1. Calculate and write them down manually. This is what PHD are for
  2. Manually write down the derivative of a single layer, and compose them. Caffe is like this: in Caffe, a module is a layer of Neural Network(NN), so we have Convolution Layer, Fully Connected Layer(called InnerProduct). Those layer needed to be written by programmers, but other can compose their own Neural Network out of those.
  3. Provide a Domain Specific Language(DSL), and implement NN layer/other NN architecture with it. If everything in DSL is differentiable, then NN implementation is Natrually differentiable. In this case, the Deep Learning had become a Programming Language! This is Tensorflow style.

Essentially the three are the same thing: we define a DSL, manually write down the derivative for primitive construct, and implement what we want with it.

In caffe style, the primitive is composable(although not very flexible) layers. In the more extreme case of 100% manual, there are only one primitive (the whole algorithm), without any compositionality. In this view, the largest divide in different framework/handcrafting is just the ‘size’ of basic operation. The smaller the primitive construct is, the more flexible, (ignoring efficiency) the easier to implement NN. The larger it is, the more efficient it is: optimizing small construct is hard.

At the same time, there are three way to represent derivative, sorting by flexibility(again, the more flexible isnt necessarily better)

  1. For a program in DSL, return a derivative calculation function, outside of the DSL. Caffe.
  2. For a program in DSL, return a program in DSL that calculate derivative, so it is possible to do further optimization, or get higher order derivative. Theano. DeepDarkFantasy.
  3. There is a function in DSL that do differentiation. StalinGrad.

What is the features of DDF?

Although DL framework is PL, they dont support much programming construct: there are basic operation on Double, some have condition/loop, some have scan/variable(see tensorflow whitepaper), and pretty much that’s it. DDF had add all sorts of ordinary PL construct, such as recursion, goto(continuation), exception, IO.

Comparing to other AD tool, DDF is a Typed Extensible Embedded DSL - AST of the DSL is richly-typed scala term, and it is very easy to extend the DSL in scala. We also extend AD to work on others type and term (and return term with correct type), and (not completed yet) give a formal proof for it’s correctness.

How does DDF work?

In order to find the derivative form for an arbitrary term, we just need to replace all occurence of Real with Dual Number. However, we quote the whole code after transformation, so we have an AST afterward. In another word, AD is symbolic.

As a consequence of this, DDF have abandoned the concept of a loss. After calculating the derivative for a term, we will not have a loss, but only a hybrid of the original term and the loss.

As an example, the gradient for “Either[Double, (Double, Double)] => Either[Double, Double]“ does not exists.

Instead, we have a hybrid by inserting the gradient in it:

Either[(Double, Double), ((Double, Double), (Double, Double))] => Either[(Double, Double), (Double, Double)]

It is simply the original type but with every occurrences of Double replaced with pair of it.

Note: It is infact possible to find the gradient for Sum type or Function, which is what early stage DDF do. I just abandon the approach since I do not like it

As for how to do Typed Extensible EDSL, see Finally Tagless

And for how to do Lambda Abstraction, see Compiling Combinator

Formal Definition

Maybe all those are too much Hand-waving, so I will give a formal definition of mini DDF, DDF-min, so it will help the type-theoretically inclined reader.

DDF-min is based on Call By Value, Simply Typed Lambda Calculus, with Real, Sum, Product, and Recursion under Y Combinator Schema Instead of using lambda abstraction, we represent stuff with SKI & BCKW Combinator instead.

Outside of the language, there’s a with_grad_t function, which traverse the type structure, and replace $\mathbb{R}$ with $\mathbb{R}\times\mathbb{R}$.

There is also a with_grad function, which traverse an AST, transforming it from AST of type t to type (with_grad_t t).

And there’s a Logical Relation which, for everything except a Function, have a trivial definition - everything satisfy, or for recursive case, what made it up also satisfy the logical relation.

For $f:A\rightarrow B$, beside the standard requirement(for every $x:A$ in the logical relation, $f\ x:B$ is also in), we also have: if $A \rightarrow B = \mathbb{R}\rightarrow \mathbb{R}$ (which is $A = \mathbb{R}\wedge B = \mathbb{R}$), the with_grad of this function, with some wrapper, return the Newtonian definition of derivative on one dimensional calculus of f.

Note: This is not completely the same with what MarisaKirisame/DDFADC describes.

Will Forward Mode AD cause efficiency trouble?

If you are familiar with AD, you will point out that, if there are N Double input, Forward Mode must activate N time. This is simply unacceptable. Luckily, this can be solved by generalizing Dual Number: It need not be of type $\text{Double} \times \text{Double}$. It can be $\text{Double}\times(\text{Double}\times\text{Double})$, and using the latter, we can calculate two derivative in one pass. For example, given x, y, z, and if we want to get the derivative of $(x + y) * z$ with respect to $x$ and $z$, we can write:

((x, (1, 0)) + (y, (0, 0))) * (z, (0, 1)) =
(x + y,(1, 0)) * (z, (0, 1)) =
((x + y) * z, (z, x + y))

In the above equation, the zeroth term of the pair is the value for the original expression. The first term of the pair is another pair, with 0, 1 term being derivative for x, y.

Or, we can replace Double * Double[1000] with Double * Double -> Double[1000], so we do not have to traverse the whole array when we multiply a term with a literal - we simply update the double passed in. This is Reversed Mode Automatic Differentiation.

In DDF Implementation, we achieve this by introducing a TypeClass, Gradient, which is satisfy by Unit, Double, Double * Double, Double => Double[1000], and such, with constraint very similar to a Field. We use it to define Dual Number of the type Double * Gradient, and we can write fundamental operation using construct on Field.

So what is the main benefit from it, from a user point of view?

We want to achieve the examples given by Neural Networks, Types, and Functional Programming

We can give higher order derivative for code involving loop/recursion. Tensorflow do not support such.

We also want to be able to write arbitrary normal algorithm/program, but with unknown weight, and use DDF to find the best value for the weight

And at last, here is an example of using gradient descent to solve $x^2+2x+3=27$.