NBIS
3/23/22
A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).
The activity rule is given by two steps:
\[a = \sum_{i} w_ix_i, \quad i=0,...,n\]
\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]
A single neuron has \(n\) inputs \(x_i\) and an output \(y\). To each input is associated a weight \(w_i\).
The activity rule is given by two steps:
\[a = \sum_{i} w_ix_i, \quad i=0,...,n\]
\[\begin{array}{ccc} \mathrm{activation} & & \mathrm{activity}\\ a & \rightarrow & y(a) \end{array}\]
\[a = w_0 + \sum_{i} w_ix_i, \quad i=1,...,n\]
\[y = y(a) = g\left( w_0 + \sum_{i=1}^{n} w_ix_i \right)\]
or in vector notation
\[y = g\left(w_0 + \mathbf{X^T} \mathbf{W} \right)\]
where:
\[\quad\mathbf{X}= \begin{bmatrix}x_1\\ \vdots \\ x_n\end{bmatrix}, \quad \mathbf{W}=\begin{bmatrix}w_1\\ \vdots \\ w_n\end{bmatrix}\]
Vectorized versions: input \(\boldsymbol{x}\), weights \(\boldsymbol{w}\), output \(\boldsymbol{y}\)
\[a = \boldsymbol{wx}\]
one to one
many to one
one to many
many to many
Image classification
Sentiment analysis
Image captioning
Machine translation
Assume multiple time points.
Assume multiple time points.
Assume multiple time points.
“dog bites man” vs “man bites dog”
Assume multiple time points.
Assume multiple time points.
Folded representation
Unfolded representation
Add a hidden state \(h\) that introduces a dependency on the previous step:
\[ \hat{Y}_t = f(X_t, h_{t-1}) \]
RNNs have what one could call “sequential memory” (Phi, 2020)
Exercise: say alphabet in your head
A B C … X Y Z
Modification: start from e.g. letter F
May take time to get started, but from there on it’s easy
Now read the alphabet in reverse:
Z Y X … C B A
Memory access is associative and context-dependent
Add recurrence relation where current hidden cell state \(h_t\) depends on input \(x_t\) and previous hidden state \(h_{t-1}\) via a function \(f_W\) that defines the network parameters (weights):
\[ h_t = f_\mathbf{W}(x_t, h_{t-1}) \]
Note that the same function and weights are used across all time steps!
class RNN:
# ...
# Description of forward pass
def step(self, x):
# update the hidden state
self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
# compute the output vector
y = np.dot(self.W_hy, self.h)
return y
rnn = RNN()
ff = FeedForwardNN()
for word in input:
output = rnn.step(word)
prediction = ff(output)
\[ \hat{Y}_t = \mathbf{W_{hy}^T}h_t \]
\[ h_t = \mathsf{tanh}(\mathbf{W_{xh}^T}X_t + \mathbf{W_{hh}^T}h_{t-1}) \]
\[ X_t \]
Note: \(\mathbf{W_{xh}}\), \(\mathbf{W_{hh}}\), and \(\mathbf{W_{hy}}\) are shared across all cells!
Not all inputs are of equal length
“I grew up in England, and … I speak fluent English”
“dog bites man” != “man bites dog”
Adresses points 2 and 3.
from keras.models import Sequential
from keras.layers import Dense, SimpleRNN
model = Sequential()
model.add(SimpleRNN(units=3, input_shape=(time_steps, 1),
activation="tanh"))
model.add(Dense(units=1, activation="tanh"))
model.compile(loss='mean_squared_error', optimizer='adam')
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
simple_rnn (SimpleRNN) (None, 3) 15
dense (Dense) (None, 1) 4
=================================================================
Total params: 19
Trainable params: 19
Non-trainable params: 0
_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
simple_rnn (SimpleRNN) (None, 3) 15
dense (Dense) (None, 1) 4
=================================================================
Total params: 19
Trainable params: 19
Non-trainable params: 0
_________________________________________________________________
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
simple_rnn (SimpleRNN) (None, 3) 15
dense (Dense) (None, 1) 4
=================================================================
Total params: 19
Trainable params: 19
Non-trainable params: 0
_________________________________________________________________
NB! In keras, RNN input is a 3D tensor with shape [batch, timesteps, feature]
Example network trained on “hello” showing activations in forward pass given input “hell”. The outputs contain confidences in outputs (vocabulary={h, e, l, o}). We want blue numbers high, red numbers low. P(e) is in context of “h”, P(l) in context of “he” and so on.
What is the topology of the network?
4 input units (features), 4 time steps, 3 hidden units, 4 output units
See if you can improve the airline passenger model. Some things to try:
Errors are propagated backwards in time from \(t=t\) to \(t=0\).
Problem: calculating gradient may depend on large powers of \(\mathbf{W_{hh}}^{\mathsf{T}}\) (e.g. \(\delta\mathcal{L} / \delta h_0 \sim f((\mathbf{W_{hh}}^{\mathsf{T}})^t)\)
In layer \(i\) gradient size ~ \((\mathbf{W_{hh}}^{\mathsf{T}})^{t-i}\)
\(\downarrow\)
Weight adjustments depend on size of gradient
\(\downarrow\)
Early layers tend to “see” small gradients and do very little updating
\(\downarrow\)
Bias parameters to learn recent events
\(\downarrow\)
RNN suffer short term memory
ReLU (or leaky ReLU) instead of sigmoid or tanh.
Prevents small gradient: for \(\mathbb{x>0}\), gradient positive constant
Derivatives of \(\sigma\), \(\mathsf{tanh}\) and \(\mathsf{ReLU}\) activation functions.
Set bias=0, weights to identity matrix
For example LSTM. Idea is to control what information is retained within each RNN unit.
Make use regular multiplication (x) and addition (+).
LSTM
GRU
Long Short Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997) and Gated Recurrent Unit (GRU) (Cho et al., 2014) architectures were proposed to solve the vanishing gradient problem.
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
In this paper, we propose a novel neural network model called RNN Encoder-Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder-Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Cho et al., 2014)
Remember the important parts, pay less attention to (forget) the rest.
LSTM adds cell state that in effect provides the long-term memory
Information flows in the cell state from \(c_{t-1}\) to \(c_t\).
Gates affect the amount of information let through. The sigmoid layer outputs anything from 0 (nothing) to 1 (everything).
In our preliminary experiments, we found that it is crucial to use this new unit with gating units. We were not able to get meaningful result with an oft-used tanh unit without any gating.
Purpose: reset content of cell state
Purpose: decide when to read data into cell state
Purpose: read entries from cell state
Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\) for each value in cell state \(c_{t-1}\), where 0 means “reset entry”, 1 “keep it”
Purpose: decide what information to keep or throw away
Sigmoid squishes vector \([\boldsymbol{h_{t-1}}, \boldsymbol{x_t}]\) (previous hidden state + input) to \((0, 1)\) for each value in cell state \(c_{t-1}\), where 0 means “forget entry”, 1 “keep it”
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f) \]
Two steps to adding new information:
Two steps to adding new information:
\[ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c) \]
\[ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t \]
Output is filtered version of cell state.
\[ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]
\[ f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)\\ i_t = \sigma (W_i \cdot [h_{t-1}, x_t] + b_i)\\ \tilde{c}_t = \mathsf{tanh}(W_c \cdot [h_{t-1}, x_t] + b_c)\\ c_t = f_t * c_{t-1} + i_t * \tilde{c}_t\\ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)\\ h_t = o_t * \mathsf{tanh}(c_t) \]
\[ x_t \in \mathbb{R}^{n\times d}, h_{t-1} \in \mathbb{n \times h}, i_t \in \mathbb{R}^{n\times h}, f_t \in \mathbb{R}^{n\times h}, o_t \in \mathbb{R}^{n\times h}, \]
and
\[ W_f \in \mathbb{R}^{n \times (h+d)}, W_i \in \mathbb{R}^{n \times (h+d)}, W_o \in \mathbb{R}^{n \times (h+d)}, W_c \in \mathbb{R}^{n \times (h+d)} \]
Modify the airline passenger model to use an LSTM and compare the results. Try out different parameters to improve test predictions.
LSTM with Variable Length Input Sequences to One Character Output
Predict next character in sequence of strings
return_sequences=True
) (Brownlee, 2017)Recurrent neural networks