<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://luciusluo.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://luciusluo.github.io/" rel="alternate" type="text/html" /><updated>2026-05-04T10:08:02+00:00</updated><id>https://luciusluo.github.io/feed.xml</id><title type="html">🐧 Andy’s Blog</title><author><name>Andy Luo</name></author><entry><title type="html">Attention Mechanism and Transformer</title><link href="https://luciusluo.github.io/jekyll/update/2020/09/04/Attention-Mechanism-and-Transformer.html" rel="alternate" type="text/html" title="Attention Mechanism and Transformer" /><published>2020-09-04T21:51:14+00:00</published><updated>2020-09-04T21:51:14+00:00</updated><id>https://luciusluo.github.io/jekyll/update/2020/09/04/Attention-Mechanism-and-Transformer</id><content type="html" xml:base="https://luciusluo.github.io/jekyll/update/2020/09/04/Attention-Mechanism-and-Transformer.html"><![CDATA[<p>Hello World.</p>]]></content><author><name>Andy Luo</name></author><category term="jekyll" /><category term="update" /><summary type="html"><![CDATA[Hello World.]]></summary></entry><entry><title type="html">From Backprop to BPTT</title><link href="https://luciusluo.github.io/jekyll/update/2020/06/27/From-Backprop-To-BPTT.html" rel="alternate" type="text/html" title="From Backprop to BPTT" /><published>2020-06-27T19:43:01+00:00</published><updated>2020-06-27T19:43:01+00:00</updated><id>https://luciusluo.github.io/jekyll/update/2020/06/27/From-Backprop-To-BPTT</id><content type="html" xml:base="https://luciusluo.github.io/jekyll/update/2020/06/27/From-Backprop-To-BPTT.html"><![CDATA[<!--
<style type='text/css'>
  h3{
    color: #2a7ae2;
  }
</style>
-->

<h3><a id="hint"></a>A Hint of History</h3>

<p>The <strong>Back-propagation</strong> Algorithm is, with no doubt, one of the most important and powerful mathematical tool used by a variety of machine learning models. Using the chain rule and partial derivative, it computes the gradient of the cost function with respect to each element in the weight matrix at each layer of the neural network. And this calculation informs us how fast the cost will change when we change the weights and biases within our network.</p>

<p>First proposed by <a href="http://people.idsia.ch/~juergen/who-invented-backpropagation.html">Seppo Linnainmaa</a>
in 1970, back-propagtion was later introduced to train neural network in 1974 by <a href="http://www.werbos.com/">Paul Werbos</a> in his famous PhD <a href="https://www.wiley.com/en-us/The+Roots+of+Backpropagation%3A+From+Ordered+Derivatives+to+Neural+Networks+and+Political+Forecasting+-p-9780471598978">dissertation</a>. But the algorithm did not gain enough appreciation until <a href="http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf">a famous paper</a> in 1986 by <a href="https://en.wikipedia.org/wiki/David_Rumelhart">David Rumelhart</a>, <a href="https://www.cs.toronto.edu/~hinton/">Geoffrey Hinton</a>, and <a href="https://en.wikipedia.org/wiki/Ronald_J._Williams">Ronald Williams</a>, who achieved some breakthrough success in several supervised learning tasks.</p>

<p>Rumelhart and et al.’s paper, illustrating how back-propagation can adjust the weights of the network to better minimize the error between actual and desired output vectors, demonstrates the algorithm’s ability to create new features, work faster, and sovle problems that were “insoluble” by some earlier approaches, including the <a href="https://en.wikipedia.org/wiki/Perceptron">Perceptrons</a>.</p>

<p>Not long after Rumelhart and et al.’s paper came out, a series of variants of the Backprop model were also invented, among which the most important ones are the <strong>Back-propagation Through Time (BPTT)</strong> by <a href="https://gribblelab.org/compneuro2012/readings/Pearlmutter_1989_NeuralComputation.pdf">Pearlmutter</a> in 1989, <strong>Epochwise BPTT</strong>, <strong>Truncated BPTT (TrBPTT)</strong>, and <strong>Real Time Recurrent Learning (RTRL)</strong>, well explained by Paul Webos in his another influential <a href="http://axon.cs.byu.edu/Dan/678/papers/Recurrent/Werbos.pdf">paper</a> and Ronald Williams and David Zipser in their 1990 <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.7567&amp;rep=rep1&amp;type=pdf">article</a>.</p>

<p>In this post, I will illustrate how the simple, initial Backprop model evolves into the later BPTT and its variants, which further lay down a solid foundation for another powerful model: <a href="https://en.wikipedia.org/wiki/Long_short-term_memory">Long Short-Term Memory (LSTM)</a>. I will talk about LSTM in the next post.</p>

<ul>
  <li><a href="#hint">A History of History</a></li>
  <li><a href="#bp">How Does Backpropagation Work?</a>
    <ul>
      <li><a href="#train_bp">How to train in Backpropagation?</a></li>
    </ul>
  </li>
  <li><a href="#bptt">How Does BPTT Work?</a></li>
  <li><a href="#e-tr">Epochwise and Truncated BPTT</a></li>
  <li><a href="#rtrl">Real Time Recurrent Learning</a></li>
</ul>

<p><br /></p>
<h3><a id="bp"></a>How Does Backprop Work?</h3>
<p style="text-align: center;"><img src="/assets/img/post_img/BP1.JPG" alt="" height="50%" width="50%" />
<br />
<em>Figure 1. A simple fully connected network. (<a href="https://sites.cs.ucsb.edu/~xyan/">Image Source:</a>)</em></p>

<p>Let us start with some basic math notations. The diagram above is a fully connected neural network with \(L-1\) hidden layers and the output layer \(a^{(L)}\). The input vector \(X\) is an \(N\)-by-\(1\) vector. \(W^{(i)}\) represents the matrix multiplied by the feedforward input at each layer \(i\), while \(b^{(i)}\) is a bias vector added at each hidden layer. \(a^{(i)}\) is an \(M\)-by-\(1\) vector storing each hidden neuron at layer \(i\). At last, \(Y^{(k)}\) represents the desired value of a single output unit \(k\) inside the \(K\)-by-\(1\) label vector(from the training samples). Note that in many cases, \(dim(a^{(i)})\) may not necessarily equals \(dim(X)\) and \(dim(a^{(i)})\) can vary throughout the layers.</p>

<p>At each layer \(i\), the feedforward input \(x^{(i)}\) or \(a^{(i)}\) was multiplied by its corresponding matrix \(W^{(i)}\) and added by the bias vector \(b^{(i)}\), and we denote this result as \(z^{(i)}\). This \(z^{(i)}\) will be then passed into a non-linear activation function \(f\) (often <a href="https://en.wikipedia.org/wiki/Sigmoid_function">sigmoid</a> or <a href="https://en.wikipedia.org/wiki/Softmax_function">softmax</a>) and gives the values of the hidden vector \(a^i\), which is fed forward as input for the next layer. Let’s dig in deeper into the math behind it.</p>

<p>At the \(1^{st}\) layer, we have:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp; z^{(1)} = W^{(1)}·X + b^{(1)}\\
  &amp; a^{(1)} = f(z^{(1)}) \\
\end{align*}
$$
</div>

<p>At the \(2^{nd}\) layer, we have:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp; z^{(2)} = W^{(2)}·a^{(1)} + b^{(1)}\\
  &amp; a^{(2)} = f(z^{(2)}) \\
  &amp; ...
\end{align*}
$$
</div>

<p>The equation writes similarly for the rest of the hidden layers. Then at the output layers, we can have:</p>

<div style="text-align:center;">
$$
\begin{align*}
  &amp; z^{(L)} = W^{(L)}·a^{(L-1)} + b^{(L)}\\
  &amp; a^{(L)} = f(z^{(L)}) \\
\end{align*}
$$
</div>

<p>Then, we would like to calculate the Total Error using <strong>Mean Squared Error (MSE)</strong> function \(E\) and sum up the errors over all the output nodes (About why using the MSE, see <a href="https://en.wikipedia.org/wiki/Mean_squared_error#In_regression">here</a>),</p>

<div style="text-align:center;">
$$
\begin{align*}
  &amp; E = \frac{1}{2K} \sum_{k=1}^K (Y^{(k)} - a^{(L)}_{k})^2 + \frac{\lambda}{2} \sum_{l=1}^L (W^{(l)})^2\\
\end{align*}
$$
</div>

<p>where the first term is the <strong>MSE</strong> and the second one an optional <a href="https://towardsdatascience.com/understanding-the-scaling-of-l²-regularization-in-the-context-of-neural-networks-e3d25f8b50db">regularization</a> term.</p>

<p><br />
<strong>**Important!</strong>** Keep in mind that our ultimate goal is to calculate \(\frac{\delta E}{\delta W_{ij}^{l}}\), the derivative of error function \(E\) with respect to an arbitrary element at row \(i\) and column \(j\) in an arbitrary matrix \(W^{(l)}\) at layer \(l\).</p>

<p>Denote the error of a single output unit \(k\) as:</p>
<div style="text-align:center;">
$$H_k = \frac{1}{2}(Y^{(k)} - a_k^{(L)})^2$$
</div>

<p>As a result, the Total Error can be re-written as:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp; E = \frac{1}{K} \sum_{k=1}^K H_k + \frac{\lambda}{2} \sum_{l=1}^L (W^{(l)})^2\\
\end{align*}
$$
</div>

<p>and its derivative as:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp; \frac{\delta E}{\delta W_{ij}^{l}} = \frac{1}{K} \sum_{k=1}^K \frac{\delta H_k}{\delta W_{ij}^{l}} + {\lambda}W_{ij}^{l}\\
  \tag{1}
\end{align*}
$$
</div>

<p>The rest of the job is then to calculate \(\frac{\delta H_k}{\delta W_{ij}^{l}}\). And we can decompose it as:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp;\frac{\delta H_k}{\delta W_{ij}^{l}} = \frac{\delta H_k}{\delta z_{i}^{l}} · \frac{\delta z_{i}^{l}}{\delta W_{ij}^{l}}
  \tag{2}
\end{align*}
$$
</div>

<p>and denote:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp; \delta_i^{(l)} = \frac{\delta H_k}{\delta z_{i}^{l}}
  \tag{3}
\end{align*}
$$
</div>

<p>Okay! Now let’s first start with an example to compute the derivative of \(H\) w.r.t the output unit \(i\) at the <strong>output layer</strong> \(L\) using the chain rule:</p>
<div style="text-align:center;">
$$
\begin{align*}
  \delta_i^{(L)} &amp;= \frac{\delta H_i}{\delta z_{i}^{L}} = \frac{\delta}{\delta z_{i}^{L}} \frac{1}{2}(Y^{(i)} - a_{i}^{(L)})^2 \\
  &amp; = -(Y^{(i)} - a_{i}^{(L)}) · \frac{\delta}{\delta z_{i}^{L}} a_{i}^{(L)} \\
  &amp; = -(Y^{(i)} - a_{i}^{(L)}) · f'(z_i^{(L)})
\end{align*}
$$
</div>

<p>And we can use \(\delta^{(L)}\) to denote the error vector containing all these single output error. <strong>Bear in mind</strong> that this \(\delta^{(L)}\) is the term that will propagate backward into the network and help us calculate the derivatives of \(H\) w.r.t to the hidden units!</p>

<p>Now we can extend the the calculation of \(\delta_i^{(l)}\) to the hidden layers. Because the error flow is propagating backward(or leftward), we can write the equation for \(\delta_i^{(l)}\) in terms of the error from the next layer \(\delta^{(l+1)}\). (This, in fact, is <a href="https://en.wikipedia.org/wiki/Dynamic_programming">Dynamic Programming</a>)</p>
<div style="text-align:center;">
$$
\begin{align*}
  \delta_i^{(l)} &amp;= \frac{\delta H_i}{\delta z^{l+1}} \frac{\delta z^{l+1}}{\delta z_{i}^{l}} \\
  &amp; = (W^{(l+1)})^T_i · \delta^{(l+1)} · f'(z_i^{(l)})\\
  \tag{4}
\end{align*}
$$
</div>

<p>This equation might seem frightening at first sight. However, as presented by the figure below, because \(\delta_i^{(l)}\) is influenced by errors propagating backward from all the units in the next layer, indicated by the red arrows, we will have to take all of these units into account, denoted \(\delta^{(l+1)}\). So the \(i^{th}\) row of the transpose of \(W^{(l+1)}\) will be the weights that the errors are moving through.</p>

<p style="text-align: center;"><img src="/assets/img/post_img/BP2.JPG" alt="" height="50%" width="50%" /></p>

<p>Nice! Then very simply:</p>
<div style="text-align:center;">
$$
\begin{align*}
  \frac{\delta z_{i}^{l}}{\delta W_{ij}^{l}} = a_j^{(l-1)}
  \tag{5}
\end{align*}
$$
</div>

<p>Plugging eq.\((4)\) and \((5)\) back to the \((2)\) will give us the following:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp;\frac{\delta H_k}{\delta W_{ij}^{l}} = (W^{(l+1)})^T_i · \delta^{(l+1)} · f'(z_i^{(l)}) · a_j^{(l-1)}
  \tag{6}
\end{align*}
$$
</div>

<p>Finally, if we plug in eq.\((6)\) back to (1), we will obtain the final equation for \(\frac{\delta E}{\delta W_{ij}^{l}}\), which will be used in <strong>gradient descent</strong> to update the weight at each training epoch:</p>
<div style="text-align:center;">
$$
\begin{align*}
  &amp; W_{ij}^{l} = W_{ij}^{l} + \eta · \frac{\delta E}{\delta W_{ij}^{l}}
  \tag{7}
\end{align*}
$$
</div>
<h3><a id="train_bp"></a>How to train in Backpropagation?</h3>
<p>We will use this wonderful GIF below to explain how to train a network using Back-propagation.</p>
<p style="text-align: center;"><img src="/assets/img/post_img/BP3.gif" alt="" height="80%" width="80%" />
<br />
<em>Figure 3. Forward and Backward Pass of Back-propagation(<a href="https://machinelearningknowledge.ai/animated-explanation-of-feed-forward-neural-network-architecture/">Image Source</a>))</em></p>

<p>For each training sample from the dataset, say \(T_0\), we will feed the input vector into the network to perform a forward pass. After we obtain the predicted output, we will calculate the error(loss) function and perform a backward pass to update each weight in the matrix in each layer. After all weights are updated, we will then enter the next training epoch and feed in the next sample \(T_1\). The process repeats itself till we exhaust all training samples.</p>

<p>However, the training procedure in Backpropagation suffers from a major drawback termed <a href="https://www.researchgate.net/publication/222068807_Avoiding_catastrophic_forgetting_by_coupling_two_reverberating_neural_networks">catastrophic forgetting</a>. This problem happens when a net having already learned the first training sample is then trained on the second sample, the new update of the weights may entirely erase the information previously learned. This gives rise to our next model, <strong>Backpropagation Through Time</strong>, which greatly resolves the problem of Backprop.
<br /></p>

<h3><a id="bptt"></a>How Does BPTT Work?<h3>
<br />

<h3><a id="e-tr"></a>Epochwise and Truncated BPTT</h3>
<br />

<h3><a id="rtrl"></a>Real Time Recurrent Learning</h3>
</h3></h3>]]></content><author><name>By Andy Luo</name></author><category term="jekyll" /><category term="update" /><summary type="html"><![CDATA[]]></summary></entry></feed>