Recurrent Network

Recurrent Neural Network¶

Sequence Modeling¶

我们需要 sequences
Language Modeling

Language Modeling¶

我们需要建模一个概率分布，给定之前 \(i\) 个词，建模第\(i\)个词的概率分布。

What If MLP¶

把所有词向量 concat 起来，然后扔进 MLP 中，输出的大小为 Vocabulary Size，然后在里面 Sample。

2-gram LM Model

这显然会有很多缺点 - Parameter explosion - There is a limit on the longest dependencies that can be captured.

Idea 1: Local Dependency¶

\[ P(x_1,\cdots,x_T) = \prod_{t = 1}^{T} P (x_t \mid x_1,\cdots,x_{t-1}) = \prod_{t=1}^{T} g(s_{t-2},x_{t-1})\]

我们认为 \(t - 2\) 之前的信息可以编码进一个隐变量 \(s\) 中，于是用 \(\prod_{t=1}^{T} g(s_{t-2},x_{t-1})\) 做近似。

[Local Dependency Assumption]: The sequential informational of all previous timestamps can be encoded into one hidden representation.

[Temporal Stationarity Assumption]: If a feature is useful at time \(t_1\)，then it should also be useful for all time stamps \(t_2\)

这有时会导致出现 over smoothing 的事情，无法对很有特点的事情进行描述。

Recurrent Neural Network (RNN)¶

一个 RNN cell 就是一个 MLP 加上一个 \(h_{t-1}\) ，多输出一个 \(h_t\)，也就是前面所说的编码起来的信息。

把这些东西拼起来。

Recurrent Layer¶

数学上的表示就是

\[ y_t = V h_t \]

\[ h_t = f_W (h_{t-1}, x_t) \]

Bidirectional RNN¶

发扬上面的精神，从后面也拉一波过来

Deep RNN¶

加层，多整几个隐藏层，竖着加隐藏层，横着在时间轴上多整几个 RNN Cell。

RNN for LM¶

n-gram LM Model

Standard Architectures¶

many to one: Sentiment Classification
one to many: Image Captioning
many to many (heterogeneous): Machine Translation
many to mang (homogeneous): Language Modeling

One to Many¶

常用于图像描述等问题。

Challenges:

Long-term dependency: which word corresponds to which region
Heterogeneous input and output: from image to text generation

Sequence to Sequence¶

Many-to-One + One-to-Many

Encoder: many to one
Decoder: one to many

information bottleneck

Beam Search¶

Back-Propagation Through Time¶

RNNs are particularly unstable due to te repeated mulitplication by the same weight matrix.

Exploding and Vanishing Gradient¶

Exploding Gradient: Gradient Clipping¶

\[g := \frac{\partial \epsilon}{\partial \theta}\]

如果 \(\Vert g \Vert\) 大于某个阈值，就给它 normalize 成一个给定的阈值。

Recurrent Network

Recurrent Neural Network¶

Sequence Modeling¶

Language Modeling¶

What If MLP¶

Idea 1: Local Dependency¶

Recurrent Neural Network (RNN)¶

Recurrent Layer¶

Bidirectional RNN¶

Deep RNN¶

RNN for LM¶

Standard Architectures¶

One to Many¶

Sequence to Sequence¶

Beam Search¶

Back-Propagation Through Time¶

Exploding and Vanishing Gradient¶

Exploding Gradient: Gradient Clipping¶

Vanishing Gradient: Identity Initialization?¶

Long Short-Term Memory (LSTM)¶

PracticalTraining Strategies¶

Spacetime Long Short-Term Memory¶

Recurrent Network

Recurrent Neural Network¶

Sequence Modeling¶

Language Modeling¶

What If MLP¶

Idea 1: Local Dependency¶

Idea 2: Parameter Sharing¶

Recurrent Neural Network (RNN)¶

Recurrent Layer¶

Bidirectional RNN¶

Deep RNN¶

RNN for LM¶

Standard Architectures¶

One to Many¶

Sequence to Sequence¶

Beam Search¶

Back-Propagation Through Time¶

Exploding and Vanishing Gradient¶

Exploding Gradient: Gradient Clipping¶

Vanishing Gradient: Identity Initialization?¶

Long Short-Term Memory (LSTM)¶

PracticalTraining Strategies¶

Spacetime Long Short-Term Memory¶