Prerequisites
I wrote tinyml for learning purpose, and I found it is useful and interesting to figure out what is a deep learning framework doing under their apis. I hope this project can be useful for other students to understand how deep learning works.
In tinyml, we focus on three main tasks: construct the neural networks, perform the training, evaluating and visualizing processes and export the trained weight to the persistent storage (i.e. the hard disk). We have made the following modules to achieve these goals:
Convolutional Neural Network: Convolutional Neural Network (CNN) is a class of neural networks, and has been proven to be effective for most computer vision tasks. In a CNN architecture for image classification, there are usually three important components: the convolutional layer, the pooling layer and the fully connected layer. The first two types are designed to extract highlevel features from images, and the fully connected layer can be used as a classifier to output the classification results. In a convolutional neural network, the convolutional and fully connected layers are equipped with parameters called weight and bias, which will be updated during training. In this work, we implemented these components along with other necessary components, such as activations (ReLu function), losses (CrossEntropy Loss), etc.
Deconvolution: The two fundamental components of the CNN are convolutional layers and pooling layers, which works together to transform images into feature maps. Deconvolutional operation is a transformation that goes in the opposite direction of a normal convolution operation, i.e. from the feature maps that we extracted with convolution operation to something that has the shape of the input to it. After being introduced, deconvolution has been used in many fields such as pixelwise segmentation, generative models, etc. In this work, we use deconvolution to map the intermediate feature maps back to the input pixel space. With this approach, we can show the relationship between the input patterns and the activations in the feature maps. There are also two components in the approach, called deconvolutional layers and unpooling layers, and we will explain these two concepts in more detail in Section 3 Approach.
Stochastic Gradient Descent: In neural networks, we want to find a function $\hat{y}=F(x)$ such that $yF(x)$ is minimal. The function that we are looking for is usually nonlinear, nonconvex and there is generally no formula for it. As a result, gradient descent becomes one of the most popular methods to find the local minimum of the function. The method is based on a fact that the function $f$ decreases fastest along the direction of the negative gradient. Formally, we can define a function that measures the difference between $\hat{y}$ and $y$, for example $f=y\hat{y}$ and assume that $a$ is the only parameter in $f$. Then if we let $a_{n+1}=a_n\epsilon\nabla_af$ and we want to find the lowest value of $f(a)$ around the point $a$. Then if $\epsilon$ is small enough, we will have $f(a_{n+1})\leq f(a_n)$ and $f(a_{n+1})$ is the smallest value around a small enough interval of $a$. Considering this, if we want to find the local minimal of the function $f$, we can start at a random point $a_0$, and follows the negative direction of the gradient. With this approach, we will have a sequence $a_1, a_2,\cdots a_n$ that satisfies $a_{n+1}=a_n\epsilon\nabla_af$. Then the output of the function $f$ will satisfy the rule that $f(a_n)\leq f(a_{n1})\leq\cdots \leq f(a_{0})$. By doing so, we could find an approximate value $a_n$ such that $f(a_n)$ is the local minimal.
Backpropagation: In the process of gradient descent, we found that we need to compute the gradient of our function $f$ in every step. Backpropagation, as an application of the chain rule, is an efficient algorithm for calculating the gradient in deep neural networks. In short, it first computes the gradient of the loss function to the weight of the last layer in a neural network, and passes the gradient of the loss function to the input of the layer to previous layers. There are two bases for the algorithm:
 In the $i$th layer, we can receive the gradient of the loss $\ell$ with respects to the output of $i$th layer, i.e. $\frac{\partial \ell}{\partial \hat{y}^{(i)}}$ is known to us.
 Since the output of $(i1)$th layer is the input of the $i$th layer, we have $\frac{\partial \ell}{\partial x^{(i)}}=\frac{\partial \ell}{\partial \hat{y}^{i1}}$
Having these two bases, we could compute the gradient of the loss $\ell$ with respect to the weight and input of every layer by applying chain rules. For example, $\frac{\partial \ell}{\partial w^{(i)}}=\frac{\partial \ell}{\partial \hat{y}^{(i)}}\frac{\partial \hat{y}^{(i)}}{\partial w^{(i)}}$ where we only need to know how to compute $\hat{y}^{(i)}$ with $w^{(i)}$. With these, we could efficiently compute the loss value with respect to every parameter in the neural network and update them with the SGD method.
Numerical Differentiation Besides manually working out the derivatives, we can also estimate the derivatives with numerical approximation. numerical differentiation is an algorithm for estimating the derivative of a mathematical function using the values of the function.
The simplest method, also known as Newton’s differentiation quotient is by using the finite difference approximations. More specifically, if we have a function $f(x)$ and we want to compute the derivative of $f$, we can approximate it by computing the slope of a nearby secant line through the points $(x, f(x))$ and $(x+\epsilon, f(x+\epsilon))$. The slope of this secant line will be $\frac{f(x+\epsilon)f(x)}{\epsilon}$, and the derivative is the tangent line at the point $x$. As $\epsilon$ approaches $0$, the slope of the secant line approaches the slope of the tangent line. Therefore, the true derivative of $f$ at the point $x$ can be defined as $f’(x)=\lim_{\epsilon\to0}\frac{f(x+\epsilon)f(x)}{\epsilon}$. Then by this nature, we could manually choose a small enough $\epsilon$, and to approximately approach the tangent line, i.e. the derivative of the function $f$ at $x$.
As we know the point $x+\epsilon$ is at the right of $x$, the form $\frac{f(x+\epsilon)f(x)}{\epsilon}$ is called rightsided form. Besides this form, we can also approach the tangent line from the left side and right side (the twosided form) at the same time. To do so, we compute the slope of a nearby secant line through the points $(x\epsilon, f(x\epsilon))$ and $(x+\epsilon, f(x+\epsilon))$ by $\frac{f(x+\epsilon)f(x\epsilon)}{2\epsilon}$. This form is a more accurate approximation to the tangent line than the onesided form and therefore we will use the twosided form in the following sections.
Linear Transformation
As we usually represent the data (such as the input, output, other parameters, etc) in neural networks as matrices, and the most fundamental transformations in neural networks are linear, it is essential to understand how to compute the derivatives of a linear transformation of matrices by using the chain rule.
Since there is no such concept “Layer” in this process, we will use $X$ and $Y$ to denote the matrices and $x_{ij}$ and $y_{kl}$ to represent entries in matrices without the superscripts.
Assume that we have $f(Y):\mathbb{R}^{m\times n}\to\mathbb{R}$ and a linear tranformation $g(X):\mathbb{R}^{p\times n}\to \mathbb{R}^{m\times n}, Y=g(X)=AX+B$, where $A\in\mathbb{R}^{m\times p}, B\in\mathbb{R}^{m\times n}$. We can compute the derivatives of $f$ with respect to $X$ as the following:

We know, at the point $x$, if there are two intermediate variables $u=\phi(x)$ and $v=\psi(x)$ that have partial derivatives with respect to $x$ defined, then the composited function $f(u,v)$ has partial derivatives with respect to $x$ defined and can be computed as $\frac{\partial f}{\partial x}=\frac{\partial f}{\partial u}\frac{\partial u}{\partial x}+\frac{\partial f}{\partial v}\frac{\partial v}{\partial x}$. In our case, there might be several intermediate variables $y_{kl}$, hence we have $\frac{\partial f}{\partial x_{ij}}=\sum_{kl}\frac{\partial f}{\partial y_{kl}}\frac{\partial y_{kl}}{\partial x_{ij}}$ (1).

Let $a_{ij}$ and $b_{ij}$ represent elements in $A$ and $B$, then $y_{kl}=\sum_{s}a_{ks}x_{sl}+b_{kl}$. Hence we have $\frac{\partial y_{kl}}{\partial x_{ij}}=\frac{\partial \sum_{s}a_{ks}x_{sl}}{\partial x_{ij}}=\frac{\partial a_{ki}x_{il}}{\partial x_{ij}}=a_{ki}\delta_{lj}$ (2). Here $\delta_{lj}$ is defined as $\delta_{lj}=1$ when $l=j$, otherwise $\delta_{lj}=0$. Intuitively, we know that for the single pair $(x_{ij}, y_{kl})$, there is a relation $y_{kl}=a_{ki}x_{il}+b_{kl}$. Hence the derivative of $y_{kl}$ with respect to $x_{ij}$ is $a_{ki}$ only when $l=j$, otherwise, the derivative will be $0$.

Take (2) into (1), we will have $\frac{\partial f}{\partial x_{ij}}=\sum_{kl}\frac{\partial f}{\partial y_{kl}}\frac{\partial y_{kl}}{\partial x_{ij}}=\sum_{kl}\frac{\partial f}{\partial y_{kl}}a_{ki}\delta_{lj}=\sum_{k}\frac{\partial f}{\partial y_{kj}}a_{ki}$ (because only $y_{kj}$ will be kept). In this equation, we know that $a_{ki}$ is the $i$th row of $A^T$ and $\frac{\partial f}{\partial y_{kj}}$ is the $(k,j)$ element in the gradient of $f$ with respect to $Y$. In summary, this equation tells us that the derivative of $f$ with respect to $x_{ij}$ is the dot product of the $i$th row of $A^T$ and the $j$th column of $\nabla_Yf$.

Now that we have already known that $\frac{\partial f}{\partial x_{ij}}$ is the dot product of the $i$th row of $A^T$ and the $j$th column of $\nabla_Yf$. Then for $\frac{\partial f}{\partial X}$, we have
$$ \frac{\partial f}{\partial X}=\left[ {\begin{array}{ccc} \frac{\partial f}{\partial x_{11}} & \cdots & \frac{\partial f}{\partial x_{1n}} \newline \vdots & \frac{\partial f}{\partial x_{ij}} & \vdots \newline \frac{\partial f}{\partial x_{p1}} & \cdots & \frac{\partial f}{\partial x_{pn}} \end{array} } \right]=A^T\nabla_Y f $$
because every element in $\frac{\partial f}{\partial X}$ equals to a inner product of a row and a column.

It is also common that $g(X)$ is defined as $g(X):\mathbb{R}^{m\times p}\to \mathbb{R}^{m\times n}, Y=g(X)=XC+D$ where $C\in\mathbb{R}^{p\times n}$ and $D\in\mathbb{R}^{m\times n}$. In this case, we have $f(Y):\mathbb{R}^{m\times n}\to\mathbb{R}$ defined as well. Then to compute $\frac{\partial f}{\partial X}$, we first consider $Y^T$ and we have $Y^T=(XC+D)^T=C^TX^T+D^T$. Then by the laws we have found above, we immediately know that $\nabla_{X^T}f=(C^T)^T\nabla_{Y^T} f=C\nabla_{Y^T} f$. Therefore, we have $\nabla_X f = (\nabla_{X^T}f)^T=(C\nabla_{Y^T} f)^T=(\nabla_{Y^T} f)^TC^T=(\nabla_{Y} f)C^T$
In summary, if we have two functions, $f(Y)$ takes a matrix and returns a scalar, and a linear tranformation function $g(X)$, then we can perform the differentiation using the chain rule. More specifically, we have found two laws:
 If the linear transformation is defined as $g(X)=AX+B$ (we call this left multiplcation), then we have $\nabla_X f=A^T\nabla_Y f$. (Law 1)
 If the linear transformation is defined as $g(X)=XC+D$ (we call this right multiplcation), then we have $\nabla_X f=(\nabla_Y f)C^T$. (Law 2)
Note: We should be careful about the independent variable. Here we are computing the gradient of the function $f$ with respect to $X$, we can also compute the gradient of the function $f$ with respect to $C$. In that case, we should use a different law.
Linear Layer
(also known as fully connected layer, dense layer, etc)
Definition The fully connected layer is the simplest and most fundamental neural network component as it connects all the nodes in a layer to all the nodes in the next layer. For example, in the below figure we present a simple fully connected layer that connects all nodes, in which yellow nodes represent the bias (the nodes that are independent of the output of the previous layer) in each layer, green nodes represent the input to the neural network, blue nodes represent the intermediate activities and red nodes represent the output results.
From the definition of fully connected layer, we know that in the $i$th layer, if there are $n^{(i)}$ input nodes (in our figure, $n^{(1)}=2$ (the green nodes), $n^{(2)}=2$ (the blue nodes)) and $m^{(i)}$ output nodes (in our figure, $m^{(0)}=2$ (the blue nodes), $m^{(1)}=2$ (the red nodes)), there will be $m^{(i)}\times n^{(i)}$ relations between them. These relations can be represented as a $m^{(i)}\times n^{(i)}$ matrix $w^{(i)}$ (weight). Besides of these relations, there will be $m^{(i)}$ biases that can be represented as a $m^{(i)}$d vector $b^{(i)}$ (bias). (in our figure, $b^{(0)}=2$ ($x_0$ to $h_1^{(1)}$ and $x_0$ to $h_2^{(1)}$), $b^{(1)}=2$ ($h_0^{(1)}$ to $\hat{y}_1$ and $h_0^{(1)}$ to $\hat{y}_2$))
Forward Pass With the above notations, we can directly compute the forward pass. For example, we can compute $h_1^{(1)}$ as $h_1^{(1)}=x_1w_{11}^{(1)} + x_2w^{(1)}_{21}+x_0$.
In the above equation, $w_{jk}^{(i)}$ is the relation between $x_j$ and $h^{(i)}_{k}$. In matrix form, we have $\hat{Y}=Xw+b$ where $w$ is the weight matrix, $X$ is the input vector and $b$ is the bias vector. In the matrix form, $w$ is a $n\times m$ matrix, $X$ is a $n$d vector and $b$ is a $m$d vector.
Backward Pass In the training process, we want to update the relations between the input nodes and the output nodes. More specifically, we want to know how the weight $w$ and bias $b$ will impact the loss value, i.e. we want to compute $\frac{\partial\ell}{\partial w}$ and $\frac{\partial\ell}{\partial b}$. In the backward pass, the $i$th layer will receive the gradient of the loss value with respect to the input of the $(i+1)$th layer. As the input of the $(i+1)$th layer is the output of the $i$th layer, we have $\frac{\partial\ell}{\partial \hat{y}_i}$ known indeed.
With these conditions, we can apply the laws we have to compute the gradients. We will have
 $\frac{\partial\ell}{\partial w}=X^T\nabla^{(i)}$ (Using the Law 1)
 $\frac{\partial\ell}{\partial b}=\frac{\partial\ell}{\partial \hat{y}^i}\frac{\partial \hat{y}^i}{\partial b}=\nabla^{(i)}$
With these results, we can update the weight and bias in the $i$th layer with $w^{new}=w^{old}\epsilon \nabla^{(i)} x^T$ and $b^{new}=b^{old}\epsilon \nabla^{(i)}$ where $\epsilon$ is a preset hyperparameter called learning rate. After updating these parameters, we need to pass the gradient of the loss with respect to the input of this layer to the previous layer. Formally, we also have $\frac{\partial\ell}{\partial x^{i}}=\nabla^{(i)} w^T$ (Using the Law 2).
Convolutional Layer
Definition In fully connected layers, the input is always a onedimensional vector. However, images are usually stored as a multidimensional matrix and have implicit spatial structures. For example, the eyes are always on top of the nose, etc. These properties are not well expressed using a fully connected layer. Hence we use the convolutional layers to preserve these properties. For example, assume we have a single channel input matrix $A_{3\times 3}$ and single filter matrix $B_{2\times 2}$, and
$$A=\left[ {\begin{array}{ccc} a_{11} & a_{12} & a_{13} \newline a_{21} & a_{22} & a_{23} \newline a_{31} & a_{32} & a_{33} \end{array} } \right], B=\left[ {\begin{array}{ccc} b_{11} & b_{12} \newline b_{21} & b_{22} \end{array} } \right]$$
Then, we slide the filter $B$ with a unit stride, i.e. move one column or one row at a time. If we use $a_{ij}$ and $b_{ij}$ to denote the element in $A$ and $B$ at the $(i,j)$ location. Then we can obtain the output of the convolutional layer with the following steps:
 At the beginning, the $2\times 2$ filter is placed at the upper left corner. At this time, we perform the dot product and we will have $y_{11}=a_{11}b_{11}+a_{12}b_{12}+a_{21}b_{21}+a_{22}b_{22}$.
 Then we slide the filter across the width for a unit stride, i.e. we move the slide to the upper right corner. At this time, we perform the dot product and we will have $y_{12}=a_{12}b_{11}+a_{13}b_{12}+a_{22}b_{21}+a_{23}b_{22}$.
 Then we found that there is no more values on the right side, so we start to slide the filter on the next row. At this time, we start at the bottom left corner and we can obtain that $y_{21}=a_{21}b_{11}+a_{22}b_{12}+a_{31}b_{21}+a_{32}b_{22}$.
 Then we again slide the filter to the right side, i.e. the bottom right corner and we obtain that $y_{22}=a_{22}b_{11}+a_{23}b_{12}+a_{32}b_{21}+a_{33}b_{22}$.
After these steps, we found that we get four outputs, and we can obtain the final output if we place the output values to corresponding locations, i.e. the value we computed at the upper left corner is placed at the upper left corner and so on so forth. Hence, in this example, we get a $(1,2,2)$ output matrix
$$\begin{aligned} C=\left[ {\begin{array}{cc} y_{11} & y_{12} \newline y_{21} & y_{22} \end{array} } \right] \end{aligned}$$
More generally, we can formalize the input, the filters and the output of a convolutional layer as below:
 The input is a $(N, C, H, W)$ tensor, where $N$ is the number of the input matrix in a single batch, $C$ is the channel of the matrix, $H$ and $W$ are the height and width of the input matrix. For example, $10$ coloured images with the size $(224, 224)$ can be represented as a $(10, 3, 224, 224)$ tensor.
 The filters is a $(K, C, H_f, W_f)$ tensor, where $K$ is the number of filters, $C$ is the channel of the filters and it will always be identical to the channel of the input matrix. $H_f$ and $W_f$ are the height and width of the filters.
 The output is a $(N, K, H_{out}, W_{out})$ tensor.
 The stride that we use to slide the filters are denoted as $S$.
With these notations, we can compute the output of a convolution layer with seven loops.
Though the convolution operation can be computed by the above algorithm, we can still use matrix multiplication to perform such computation as suggested by A guide to convolution arithmetic for deep learning. The benefits of using matrix multiplication are twofold:
 We have already gotten two laws for computing the differentiation of linear transformation. If we can define the convolution operation as $g(x)=AX+B$ (i.e. matrix multiplication), we could easily reuse the two laws and get the derivative of the loss value with respect to the filter and input.
 We are about to study how to compute the forward pass of Deconv operation and that operation can be easily defined with the matrix multiplication form of the convolution operation. We will see this in Section 3.2.1 Deconv.
Hence, in the below computation of the forward pass and backward pass of the convolution operation, we will show how to convert it into a matrix multiplication.
Forward Pass Recall that the input $X$, the filter $W$ and the expected output $Y$ we have in the above example.
$$X=\left[ {\begin{array}{ccc} a_{11} & a_{12} & a_{13} \newline a_{21} & a_{22} & a_{23} \newline a_{31} & a_{32} & a_{33} \end{array} } \right], W=\left[ {\begin{array}{cc} b_{11} & b_{12} \newline b_{21} & b_{22} \end{array} } \right], Y=\left[ {\begin{array}{cc} y_{11} & y_{12} \newline y_{21} & y_{22} \end{array} } \right]$$
If we unroll the input and the output into vectors from left to right, top to bottom, we can also represent the filters as a sparse matrix where the nonzero elements are the elements in the filters. For example, We can unroll the input and output in our case as $X^\star_{9\times 1}=[a_{11},a_{12},a_{13},a_{21},a_{22},a_{23},a_{31},a_{32},a_{33}]^T$ and $Y^\star_{4\times 1}=[y_{11},y_{12},y_{21} ,y_{22}]^T$. Then we will want a $4\times 9$ matrix $W^\star$ such that $Y^\star = W^\star X^\star$. From the direct computation of convolutional layers, we can transform the original filters $W$ into
$$\begin{aligned} W^*=\left[ {\begin{array}{ccccccccc} b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 & 0 \newline 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 \newline 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 \newline 0 & 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} \end{array} } \right] \end{aligned}$$
Then we can verify that
$$\begin{aligned} W^\star X^\star=\left[ {\begin{array}{ccccccccc} b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 & 0 \newline 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 & 0 & 0 \newline 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} & 0 \newline 0 & 0 & 0 & 0 & b_{11} & b_{12} & 0 & b_{21} & b_{22} \end{array} } \right] \times [a_{11},a_{12},a_{13}\cdots, a_{33}]^T \end{aligned}$$
is equal to $Y^\star$.
Backward Pass From the forward pass, we converted the filters into a sparse matrix $W^\star$ and we found that the convolution is also a linear operation, i.e. $Y^\star=W^\star X^\star$. Similar to the backward pass of fully connected layers, we can directly compute the gradient of the loss value with respect to the weight matrix as $\frac{\partial\ell}{\partial W^\star}=\nabla^{(i)} (X^\star)^T$ (Using the Law 2).
Hence, we can update the weight matrix in convolutional layers as $(W^\star)^{new}=(W^\star)^{old}\epsilon \nabla^{(i)}(X^\star)^T$ and it is the same as in fully connected layers.
Besides, we will need to compute the gradient that we want to pass to previous layers, i.e. the gradient of the loss value with respect to the input matrix. We will have
$\frac{\partial\ell}{\partial X^\star}=\frac{\partial\ell}{\partial Y^\star}\frac{\partial Y^\star}{\partial X^\star}=(W^\star)^T\nabla^{(i)}$ (Using the Law 1).