跟我一起學PyTorch-08：循環(huán)神經網(wǎng)絡RNN

前面提到了CNN和基于CNN的各類網(wǎng)絡及其在圖像處理上的應用。這類網(wǎng)絡有一個特點，就是輸入和輸出都是固定長度的。比方說在MNIST、CIFAR-10、ImageNet數(shù)據(jù)集上，這些算法都非常有效，但是只能處理輸入和輸出都是固定長度的數(shù)據(jù)集。

在實際中，需要處理很多變長的需求，比方說在機器翻譯中，源語言中每個句子長度是不一樣的，源語言對應的目標語言的長度也是不一樣的。這時使用CNN就不能達到想要的效果。從結構上講，全連接神經網(wǎng)絡和卷積神經網(wǎng)絡模型中，網(wǎng)絡的結構都是輸入層-隱含層（多個）-輸出層結構，層與層之間是全連接或者部分連接，但是同一層內是沒有連接的，即所有的連接都是朝一個方向。

使計算機模仿人類的行為一直是大家研究的方向，對于圖片的識別可以用CNN，那么序列數(shù)據(jù)用什么呢？本章介紹對于序列數(shù)據(jù)的處理，以及神經網(wǎng)絡家族中另一種新的神經網(wǎng)絡——循環(huán)神經網(wǎng)絡（Recurrent Neural Network，RNN）。RNN是為了處理變長數(shù)據(jù)而設計的。

本章內容首先提到的是序列數(shù)據(jù)的處理，然后介紹標準的RNN以及它面臨的一些問題，隨后介紹RNN的一些擴展LSTM（Long Short-Term Memory）以及RNNs（Recurrent Neural Networks，基于循環(huán)神經網(wǎng)絡變形的統(tǒng)稱）在NLP（Natural Language Process，自然語言處理）上的應用，最后結合一個示例介紹PyTorch中RNNs的實現(xiàn)。

1.序列數(shù)據(jù)處理

序列數(shù)據(jù)包括時間序列及串數(shù)據(jù)，常見的序列有時序數(shù)據(jù)、文本數(shù)據(jù)、語音數(shù)據(jù)等。處理序列數(shù)據(jù)的模型稱為序列模型。序列模型是自然語言處理中的一個核心模型，依賴時間信息。傳統(tǒng)機器學習方法中序列模型有隱馬爾可夫模型（Hidden Markov Model，HMM）和條件隨機場（Conditional Random Field，CRF）都是概率圖模型，其中HMM在語音識別和文字識別領域應用廣泛，CRF被廣泛應用于分詞、詞性標注和命名實體識別問題。

神經網(wǎng)絡處理序列數(shù)據(jù)，幫助我們從已知的數(shù)據(jù)中預測未來的模式，在模型識別上取得很好的效果。作為預測未來模式的神經網(wǎng)絡有窺視未來的本領，比方說可以根據(jù)過去幾天的股票價格來預測股票趨勢。在前饋神經網(wǎng)絡中對于特定的輸入都會有相同的輸出，所以我們要正確地對輸入信息進行編碼。對時間序列數(shù)據(jù)的編碼有很多，其中最簡單、應用最廣泛的編碼是基于滑動窗口的方法。下面介紹一下滑動窗口編碼的機制。

在時間序列上，滑動窗口把序列分成兩個窗口，分別代表過去和未來，這兩個窗口的大小都需要人工確定。比如要預測股票價格，過去窗口的大小表示要考慮多久以前的數(shù)據(jù)進行預測，如果要考慮過去5天的數(shù)據(jù)來預測未來2天的股票價格，此時的神經網(wǎng)絡需要5個輸入和2個輸出?？紤]下面一個簡單的時間序列：

1,2,3,4,3,2,1,2,3,4,3,2,1

神經網(wǎng)絡可以用3個輸入神經元和1個輸出神經元，也就是利用過去3個時間的數(shù)據(jù)預測下1個時間的數(shù)據(jù)，這時在訓練集中序列數(shù)據(jù)應該表示如下：

[1,2,3] --> [4]
[2,3,4] --> [3]
[3,4,3] --> [2]
[4,3,2] --> [1]

也就是說，從串的起始位置開始，輸入窗口大小為3，第4個為輸出窗口，是期望的輸出值。然后窗口以步長為1向前滑動，落在輸入窗口的為輸入，落在輸出窗口的為輸出。這樣在窗口向前滑動的過程中，產生一系列的訓練數(shù)據(jù)。其中，輸入窗口和輸出窗口的大小都是可以變化的，比方說要根據(jù)過去3個時間點的數(shù)據(jù)預測來2個時間點的數(shù)據(jù)，也就是輸出窗口的大小為2，此時得到的訓練數(shù)據(jù)為：

[1,2,3] --> [4,3]
[2,3,4] --> [3,2]
[3,4,3] --> [2,1]
[4,3,2] --> [1,2]

上面的兩個例子是在一個時間序列上對數(shù)據(jù)進行編碼，也可以對多個時間序列進行編碼。例如，需要通過股票過去的價格和交易量來預測股票趨勢，我們有兩個時間序列，一個是價格序列，一個是交易量序列：

序列1：1,2,3,4,3,2,1,2,3,4,3,2,1
序列2：10,20,30,40,30,20,10,20,30,40,30,20,10

這時需要把序列2的數(shù)據(jù)加入到序列1中，同樣用輸入窗口大小為3，輸出窗口大小為1為例，訓練集：

[1,10,2,20,3,30] --> [4]
[2,20,3,30,4,40] --> [3]
[3,30,4,40,3,30] --> [2]
[4,40,3,30,2,20] --> [1]

其中序列1用來預測它自己，而序列2作為輔助信息。類似的可以用到多個序列數(shù)據(jù)的預測上，而且要預測的列可以不在輸入信息流中，比方說可以用IBM和蘋果的股票價格來預測微軟的股票價格，此時微軟的股票價格不出現(xiàn)在輸入信息中。

滑動窗口機制有點像卷積操作，所以也有人稱滑動窗口為1維卷積。在自然語言處理中，滑動窗口等同于Ngram。例如，在詞性標注的任務中，輸入窗口為上下文的詞，輸出窗口輸出的是輸入窗口最右側一個詞的詞性，每次向前滑動一個窗口，直到句子結束。對于文本的向量化表示，可以使用one-hot編碼，也可以使用詞嵌入，相比來說詞嵌入是更稠密的表示，訓練過程中可以減少神經網(wǎng)絡中參數(shù)的數(shù)量，使得訓練更快。

滑動窗口機制雖然可以用來對序列數(shù)據(jù)進行編碼，但是它把序列問題處理成一對一的映射問題，即輸入串到輸出串的映射，而且兩個串的大小都是固定的。很多任務中，我們需要比一對一映射更復雜的表示，例如在情感分析中，我們需要輸入一整句話來判斷情感極性，而且每個實例中句子長度不確定；或者要使用更復雜的輸入——用一張圖片來生成一個句子，用來描述這個圖片。這樣的任務中沒有輸入到輸出的特定映射關系，而是需要神經網(wǎng)絡對輸入串有記憶功能：在讀取輸入的過程中，記住輸入的關鍵信息。這時我們需要一種神經網(wǎng)絡可以保存記憶，就是有狀態(tài)的網(wǎng)絡。下面我們來介紹有狀態(tài)的神經網(wǎng)絡。

2.循環(huán)神經網(wǎng)絡

循環(huán)神經網(wǎng)絡（Recurrent Neural Network，RNN）是從20世紀80年代慢慢發(fā)展起來的，與CNN對比，RNN內部有循環(huán)結構，這也是名字的由來。需要注意的是，RNN這個簡稱有時候也會被用來指遞歸神經網(wǎng)絡（Recursive Neural Network），但是這是兩種不同的網(wǎng)絡結構，遞歸神經網(wǎng)絡是深的樹狀結構，而循環(huán)神經網(wǎng)絡是鏈狀結構，要注意區(qū)分。提到RNN大多指的是循環(huán)神經網(wǎng)絡。RNNs即基于循環(huán)神經網(wǎng)絡變形的總稱。

RNN可以解決變長序列問題，通過分析時間序列數(shù)據(jù)達到“預測未來”的本領，比如要說的下一個詞、汽車的運行軌跡、鋼琴彈奏的下一個音符等。RNN可以工作在任意長度的序列數(shù)據(jù)上，使得其在NLP上運用十分廣泛：自動翻譯、語音識別、情感分析和人機對話等。

循環(huán)神經網(wǎng)絡里有重復神經網(wǎng)絡基本模型的鏈式形式，在標準的RNN中，神經網(wǎng)絡的基本模型僅僅包含了一個簡單的網(wǎng)絡層，比如一個雙極性的Tanh層，如下圖所示。

17634123-f247e602ea168772.png

標準RNN的前向傳播公式如下：
$S_t = tanh(W [S_{t-1}, X_t] + b)$

RNN中存在循環(huán)結構，指的是神經元的輸入是該層神經元的輸出。如下圖所示。左邊是RNN的結構圖，右邊是RNN結構按時刻展開。時刻是RNN中非常重要的概念，不同時刻記憶在隱藏單元中存儲和流動，每一個時刻的隱含層單元有一個輸出。在RNN中，各隱含層共享參數(shù)W，U，V。

image.png

記憶在隱藏單元中存儲和流動，而輸入取自于隱藏單元以及網(wǎng)絡的最終輸出。展開后可以看出網(wǎng)絡輸出串的整個過程，即圖中的輸出為 $o_{t-1}$ ， $o_{t}$ ， $o_{t+1}$ ，展開后的神經網(wǎng)絡每一層負責一個輸出。 $x_t$ 是 $t$ 時刻的輸入，可以是當前詞的一個one-hot向量， $s_{t}$ 是 $t$ 時刻的隱藏層狀態(tài)，是網(wǎng)絡的記憶單元， $s_t$ 基于前面時刻的隱藏層狀態(tài)和輸入信息計算得出，激活函數(shù)可以是Tanh或者ReLU，對于網(wǎng)絡 $t_0$ 時刻的神經網(wǎng)絡狀態(tài)可以簡單地初始化成0； $o_t$ 是 $t$ 時刻神經網(wǎng)絡的輸出，從神經網(wǎng)絡中產生每個時刻的輸出信息，例如，在文本處理中，輸出可能是詞匯的概率向量，通過Softmax得出。

根據(jù)輸入輸出的不同，RNN可以按以下情況分類：

1-N：一個輸入，多個輸出。例如：圖片描述，輸入是一個圖片，輸出是一個句子；音樂生成，輸入一個數(shù)值，代表一個音符或者一個音樂風格，神經網(wǎng)絡自動生成一段旋律；還有句子生成等。
N-1：多個輸入，一個輸出。大多根據(jù)輸出的串做預測和分類，比如語義分類、情感分析、天氣預報、股市預測、商品推薦、DNA序列分類、異常檢測等。
N-N：多個輸入，多個輸出，而且輸入和輸出長度相等。比如命名實體識別，詞性標注等輸入和輸出的長度一樣。
N-M：一般情況下N≠M，例如機器翻譯、文本摘要等，輸入輸出長度是不一樣的。

根據(jù)不同的任務，循環(huán)神經網(wǎng)絡會有不同的結構。如下圖所示。

image.png

根據(jù)傳播方向的不同，還有雙向RNN，上面講到的網(wǎng)絡結構都是通過當前時刻和過去時刻產生輸出，但是有些任務比如語音識別，需要通過后面的信息判斷前面的輸出狀態(tài)。雙向循環(huán)神經網(wǎng)絡就是為了這種需求提出的，它允許 $t$ 時刻到 $t-1$ 時刻有鏈接，從而能夠使網(wǎng)絡根據(jù)未來的狀態(tài)調整當前的狀態(tài)。這在實際應用中有很好的例子：語音識別輸入的時候，會先輸出一個認為不錯的序列，但是說完以后會根據(jù)后面的輸入調整已經出現(xiàn)的輸出。雙向RNN的結構如下圖所示。

image.png

RNN的訓練時按時刻展開循環(huán)神經網(wǎng)絡進行反向傳播，反向傳播算法的目的是找出在所有網(wǎng)絡參數(shù)下的損失梯度。因為RNN的參數(shù)在所有時刻都是共享的，每一次反向傳播不僅依賴當前時刻的計算結果，而且依賴之前的時刻，按時刻對神經網(wǎng)絡展開，并執(zhí)行反向傳播，這個過程叫做Back Propagation Through Time（BPTT），是反向傳播的擴展。和傳統(tǒng)的神經網(wǎng)絡一樣，在時間序列上展開并前向傳播計算出輸出，利用所有時刻的輸出計算損失 $y_0,y_1,...,y_{t-1},y_t$ ，模型參數(shù)通過BPTT算法更新。梯度的反向傳遞依賴的是損失函數(shù)中用到的所有輸出，并不是最后時刻的輸出。比如損失函數(shù)用到了 $y_2,y_3,y_4$ ，所以梯度傳遞的時候使用這三個輸出，而不使用 $y_0,y_1$ ，在所有時刻 $W$ 和 $b$ 都是共享的，所有反向傳播才能在所有的時刻上正確計算。

由于存在梯度消失（大多時候）和梯度爆炸（極少，但對優(yōu)化過程影響極大）的原因，導致RNN的訓練很難獲取到長時依賴信息。有時句子中對一個詞的預測只需要考慮附近的詞，而不用考慮很遠的開頭的地方，比如說在語言模型的任務中，試圖根據(jù)已有的序列預測相應的單詞：要預測“the clouds are in the sky”中最后一個單詞“sky”，不需要更多的上下文信息，只要“the clouds are in the”就足夠預測出下一個單詞就是“sky”了，這種目標詞與相關信息很近的情況，RNN是可以通過學習獲得的。但是也有一些單詞的預測需要更“遠”處的上下文信息，比如說“I grew up in France... I speak fluent Frence.”要預測最后一個單詞“French”，最近的信息“speak fluent”只能獲得一種語言的結果，但是具體是哪一種語言就需要句子其他的上下文了，就是包括“France”的那段，也就是預測目標詞依賴的上下文可能會間隔很遠。

不幸的是，隨著這種間隔的拉長，因為存在梯度消失或爆炸的問題——梯度消失使得我們在優(yōu)化過程中不知道梯度方向，梯度爆炸會使得學習變得不穩(wěn)定——RNNs學習這些鏈接信息會變得很困難。循環(huán)網(wǎng)絡需要在很長時間序列的各個時刻重復相同的操作來完成深層的計算圖，模型中的參數(shù)是共享的，導致訓練中的誤差在網(wǎng)絡層上的傳遞不斷累積，最終使得長期依賴的問題變得更加突出，使得深度神經網(wǎng)絡喪失了學習先前信息的能力。

上面是標準RNN的概念和分類，針對RNN還有很多更有效的擴展，應用廣泛的也是在其基礎上發(fā)展起來的網(wǎng)絡，下面看一下基于RNN的一些擴展。

3.LSTM和GRU

為了解決長期依賴的問題，對RNN進行改進提出了LSTM（Long Short-Term Memory，長的短期記憶網(wǎng)絡），從字面意思上看它是短期的記憶，只是比較長的短期記憶，我們需要上下文的依賴信息，但是不希望這些依賴信息過長，所以叫長的短期記憶網(wǎng)絡。

LSTM通過設計門限結構解決長期依賴問題，在標準RNN的基礎上增加了四個神經網(wǎng)絡層，使得LSTM網(wǎng)絡包括四個輸入：當前時刻的輸入信息、遺忘門、輸入門、輸出門和一個輸出（當前時刻網(wǎng)絡的輸出）。各個門上的激活函數(shù)使用Sigmoid函數(shù)，其輸出在0~1之間，可以定義各個門是否被打開或打開的程度，賦予了它去除或添加信息的能力。

下圖是LSTM的結構示意圖。從圖中可以看出有3個Sigmoid層，從左到右分別是遺忘門（Forget Gate）、輸入門（Input Gate）和輸出門（Output Gate）。三個Sigmoid層的輸入都是當前時刻的輸入 $x_t$ 和上一時刻的輸出 $h_{t-1}$ ，在LSTM前向傳播的過程中，針對不同的輸入表現(xiàn)不同的角色。下面根據(jù)不同的門限和相應的計算公式詳細說明一下LSTM的工作原理。

image.png

（1）遺忘門：也稱保持門（Keep Gate），這是從對立面說的。遺忘門控制記憶單元里哪些信息舍去（也就是被遺忘），哪些信息被保留。這些狀態(tài)是神經網(wǎng)絡通過數(shù)據(jù)學習得到的。遺忘門的Sigmoid層輸出0~1，這個輸出作用于 $t-1$ 時刻的記憶單元，0表示將過去的記憶完全遺忘，1表示將過去的信息完全保留。遺忘門在整個結構中的位置和前向傳播的公式如下所示：

image.png

（2）輸入門：也叫更新門（Update Gate）或寫入門（Write Gate）?？傊?，輸入門決定更新記憶單元的信息，包括兩個部分：一個是Sigmoid層，一個是Tanh層；Tanh層的輸入和Sigmoid一樣都是當前時刻的輸入 $x_t$ 和上一時刻的輸出 $h_{t-1}$ ，Tanh層從新的輸入和網(wǎng)絡原有的記憶信息決定要被寫入新的神經網(wǎng)絡狀態(tài)中的候選值，而Sigmoid層決定這些候選值有多少被實際寫入，要寫入的記憶單元信息只有輸入門打開才能真正地把值寫入，其狀態(tài)也是神經網(wǎng)絡自己學習到的。輸入門在整個結構中的位置和前向傳播公式如下所示：

image.png

目前為止已經有了遺忘門和輸入門，下一步就可以更新神經元狀態(tài)，也就是神經網(wǎng)絡記憶單元的值了。前面兩個步驟已經準備好了要更新的值，下面就是怎么更新了。從公式看，當前時刻的神經元狀態(tài) $C_t$ 是兩部分的和：一部分是計算通過遺忘門后剩余的信息，即上一時刻的神經元狀態(tài) $C_{t-1}$ 與 $f_t$ 的乘積；另一部分是從輸入中獲取的新信息，即 $i_t$ 與 $\tilde{C_t}$ 的乘積，得出實際要輸出到神經元狀態(tài)的信息。其中 $\tilde{C_t}$ 是 $t$ 時刻新的輸入 $x_t$ 和上一時刻神經網(wǎng)絡隱含層輸出 $h_{t-1}$ 總和后的候選值，如下圖所示。

image.png

（3）輸出門：輸出門的功能是讀取剛更新過的神經網(wǎng)絡狀態(tài)，也就是記憶單元進行輸出，但是具體哪些信息可以輸出同樣受輸出門 $o_t$ 的控制， $o_t$ 通過Sigmoid層實現(xiàn)，產生范圍（0,1）之間的值。網(wǎng)絡隱含層狀態(tài) $C_t$ 通過一個Tanh層，對記憶單元中的信息產生候選輸出，范圍是（-1,1），然后與輸出門 $o_t$ 相乘得出實際要輸出的值 $h_t$ 。輸出門在整個結構中的位置和前向傳播公式如下圖所示。

image.png

LSTM由于有效地解決了標準RNN的長期依賴問題，所以應用很廣泛，目前我們所說的RNNs大多都是指的LSTM或者基于LSTM的變體。

從上面看LSTM有復雜的結構和前向傳播公式，不過在實際應用中PyTorch有LSTM的封裝，程序中使用的時候只需要給定需要的參數(shù)就可以了。PyTorch中LSTM的定義：

torch.nn.LSTM(*args, **kwargs)

可接受的參數(shù)如下：

input_size：輸入信息的特征數(shù)

hidden_size：隱含層狀態(tài)h的特征數(shù)

num_layers：循環(huán)層數(shù)

bias：默認為True；如果設置成False，不使用偏置項b_ih和b_hh。

batch_first：如果設置成True，輸入和輸出的Tensor應該為（batch，seq，feature）的順序。

dropout：如果非零，在除輸出層外的其他網(wǎng)絡層添加Dropout層。

bidirectional：如果設置成True，變成雙向的LSTM，默認為False。

下面的程序片段是一個簡單的：LSTM的例子，定義LSTM的網(wǎng)絡結構，輸入大小為10，隱含層為20，2個循環(huán)層（注意不是時序展開的層），輸入的信息是input，隱含層狀態(tài)為h，記憶單元狀態(tài)為e，輸出是最后一層的輸出層特征的Tensor，隱含層狀態(tài)：

rnn = nn.LSTM(10,20,2)
input = Variable(torch.randn(5,3,10))
h0 = Variable(torch.randn(3,20))
c0 = Variable(torch.randn(3,20))
output,hn = rnn(input,(h0,c0))

PyTorch中還有一個LSTMCell定義如下，參數(shù)含義和LSTM一樣：

class torch.nn.LSTMCell(input_size,hidden_size,bias=True)

LSTM的實現(xiàn)內部調用了LSTMCell。LSTMCell是LSTM的內部執(zhí)行一個時序步驟，從例子可以看出：

rnn = LSTMCell(10,20)
input = Variable(torch.randn(6,3,10))
hx = Variable(torch.randn(3,20))
cx = Variable(torch.randn(3,20))
output = []
for i in range(6):
    hx,cx = rnn(input[i],(hx,cx))
    output.append(hx)

LSTM的變體有很多，一個很有名的變體是GRU（Gated Recurrent Unit），它在保證LSTM效果的情況下，將遺忘門和輸入門整合成一個更新門，同樣還將單元狀態(tài)和隱藏狀態(tài)合并，并做出一些其他改變。因為GRU比標準的LSTM少了一個門限層，使得其訓練速度更快，更方便構建更復雜的網(wǎng)絡。GRU的結構圖和前向計算公式如下圖所示：

image.png

PyTorch中GRU的定義：

class torch.nn.GRU(*args, **kwargs)

GRU的簡單示例：

rnn = nn.GRU(10,20,2)
input = Variable(torch.randn(5,3,10))
h0 = Variable(torch.randn(2,3,20))
output,hn = rnn(input,h0)

4.LSTM在自然語言處理中的應用

上面介紹了LSTM的由來和各個部分的功能，因為擅長處理序列數(shù)據(jù)，并能夠解決訓練中長依賴問題，LSTM在NLP中有著廣泛的應用。下面介紹LSTM在NLP中的一些常見應用場景。

1.詞性標注

詞性標注（Past-of-Speach Tagging，POS Tagging）是自然語言處理中最基本的任務，對給定的句子做每個詞的詞性標識，是作為其他NLP任務的基礎。這里介紹在PyTorch中使用LSTM進行POS Tagging任務。

把輸入句子表示成 $w_1,w_2,...,w_m$ ，其中 $w_i \in V$ ， $V$ 是詞匯表， $T$ 為所有詞性標簽集合，用 $y_i$ 是 $w_i$ 表示的詞性，我們要預測的是 $w_i$ 的詞性 $\hat{y_i}$ 。模型的輸出是 $\hat{y_1},\hat{y_2},...,\hat{y_M}$ ，其中 $\hat{y_i} \in T$ 。把句子傳入LSTM做預測， $i$ 時刻的隱含層狀態(tài)用 $h_i$ 表示。每個詞性有唯一的編號，預測 $\hat{y_i}$ 的前向傳播公式：

$\hat{y_i} = argmax_j(log Softmax(Ah_i+b))_j$

在隱含層狀態(tài)上作用一個仿射函數(shù)log Softmax，最終的詞性預測結果是輸出向量中最大的值，目標空間A的大小為 $|T|$ 。

數(shù)據(jù)的準備過程：

# 輸入數(shù)據(jù)封裝成Variable
def prepare_sequence(seq,to_idx):
    idxs = [to_idx[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

# 輸入數(shù)據(jù)格式，單個的詞和對應的詞性
training_data = [("The dog ate the appple".split(),["DET","NN","V","DET","NN"]),
                 ("Everybody read that book".split(),["NN","V","DET","NN"])]

word_to_idx = {}
for sent,tags in training_data:
    for word in sent:
        if word not in word_to_idx:
            word_to_idx[word] = len(word_to_idx)
print(word_to_idx)

# 詞性編碼
tag_to_idx = {"DET":0,"NN":1,"V":2}

# 一般使用32或者64維，這里為了便于觀察程序運行中權重的變化，使用小的維度
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

模型的定義：

class LSTMTagger(nn.Module):
    
    def __init__(self,embedding_dim,hidden_dim,vocab_size,tagset_size):
        super(LSTMTagger,self).__init__()
        self.hidden_dim = hidden_dim
        # 詞嵌入，給定詞表大小和期望的輸出維度
        self.word_embeddings = nn.Embedding(vocab_size,embedding_dim)
        # 使用詞嵌入作為輸入，輸出為隱含層狀態(tài)，大小為hidden_dim
        self.lstm = nn.LSTM(embedding_dim,hidden_dim)
        # 線性層把隱含層狀態(tài)空間映射到詞性空間
        self.hidden2tag = nn.Linear(hidden_dim,tagset_size)
        self.hidden = self.init_hidden()
    
    # 初始化隱含層狀態(tài)
    def init_hidden(self):
        return (autograd.Variable(torch.zeros(1,1,self.hidden_size)),
               autograd.Variable(torch.zeros(1,1,self.hidden_size)))
    
    # 前向傳播
    def forward(self,sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out,self.hidden = self.lstm(embeds.view(len(sentence),1,-1),self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence),-1))
        tag_scores = F.log_softmax(tag_space,dim=1)
        return tag_scores

具體的訓練過程可以參考PyTorch官網(wǎng)教程，這里特別要指出的是，這里的POS Tagging任務使用的損失函數(shù)是負對數(shù)似然函數(shù)，優(yōu)化器使用SGD，學習率為0.1：

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(),lr=0.1)

2.情感分析

本小節(jié)介紹一下LSTM在NLP中另外一個領域的應用：情感分析。Bjarke Felbo在論文中提到了一個情感分析的任務Deepmoji，利用表情符號訓練了12億條推文，用以了解語言是如何表達情感。通過神經網(wǎng)絡的學習，模型可以在許多情感相關的文本建模任務中獲得最先進的性能。TorchMoji是論文中提出的情感分析的PyTorch實現(xiàn)。模型包含兩個雙LSTM層，在LSTM后面鏈接一個Attention層分類器，模型的結構如下圖所示。

image.png

Deepmoji可以對輸入的句子進行情感方面的分析并生成相應的moji表情，如下圖所示。比如輸入“What is happening to me ??”和“What a good day !”會輸出不同的表情，并給出輸出的置信度。具體代碼詳見GitHub。

image.png

5.序列到序列網(wǎng)絡

1.序列到序列原理

序列到序列網(wǎng)絡（Seq2seqNetwork），也稱為編碼解碼網(wǎng)絡（Encoder Decoder Netword），由兩個獨立的循環(huán)神經網(wǎng)絡組成，被稱為編碼器（Encoder）和解碼器（Decoder），通常使用LSTM或者GRU來實現(xiàn)。編碼器處理輸入數(shù)據(jù)，其目標是理解輸入信息并表示在編碼器的最終狀態(tài)中。解碼器從編碼器的最終狀態(tài)開始，逐詞生成目標輸出的序列，解碼器在每個時刻的輸入為上一時刻的輸出，整體過程如下圖所示。

v2-22f16e24ee216751b7a8201e0db7a811_hd.jpg

串到串最常見的場景就是機器翻譯，把輸入串分詞并表示成詞向量，每個時刻一個詞語輸入到編碼網(wǎng)絡中，并利用EOS（End of Sentence）作為句子末尾的標記。句子輸入完成我們得到一個編碼器，這時可以用編碼器的隱含層狀態(tài)來初始化解碼器，輸入到解碼器的第一個詞是SOS（Start of Sentence），作為目標語言的起始標識，得到的輸出是目標語言的第一個詞，隨后將該時刻的輸出作為解碼器下一時刻的輸入。重復這個過程直到解碼器的輸出產生一個EOS，目標語言結束的標識，這時就完成了從源語言到目標語言的翻譯。后面有具體的例子。

2.注意力機制

從人工翻譯句子的經驗中可以得到很多啟發(fā)，從而改善我們提到的串到串模型。人工翻譯句子的時候，首先閱讀整個句子理解要表達的意思，然后開始寫出相應的翻譯。但是一個很重要的方面就是在你寫新的句子的時候，通常會重新回到源語言的文本，特別注意你目前正在翻譯的那部分在源語言中的表達，以確定最好的翻譯結果。而我們前面提到的串到串的模型中，編碼器一次讀入所有的輸入并總結到句子的意思保存到編碼器的隱含層狀態(tài)，這個過程像人工翻譯的第一部分，而通過解碼器得到最終的翻譯結果，解碼器處理的是翻譯的第二個部分。但是“特別注意”的部分在我們的串到串模型中還沒有體現(xiàn)，這也是需要完成的部分。

為了在串到串模型中添加注意力機制，解碼器在產生 $t$ 時刻的輸出時，讓解碼器訪問所有從編碼器的輸出，這樣解碼器可以觀察源語言的句子，這個過程是之前沒有的。但是在所有時間步都考慮編碼器的所有輸出，這和人工翻譯的過程還是不同的，人工翻譯對于不同的部分，需要關注源語言中特定的很小的部分。所以，直接讓解碼器訪問所有編碼器的輸出是不符合實際的。我們需要對這個過程進行改進，讓解碼器工作的時候可以動態(tài)地注意編碼器輸出的特定的部分。有研究者提出的解決方案是把輸入變成是串聯(lián)操作，在編碼器的輸出上使用一個帶權重，也就是編碼器在 $t-1$ 時刻的狀態(tài)，而不是直接使用其輸出。具體做法是，首先為編碼器的每個輸出關聯(lián)一個分數(shù)，這個分數(shù)由解碼器 $t-1$ 時刻的網(wǎng)絡狀態(tài)和每個編碼器的輸出的點乘得到，然后用Softmax層對這些分數(shù)進行歸一化。最后在加入到串聯(lián)操作之前，利用歸一化后的分數(shù)分別度量編碼器的輸出。這個策略的關鍵點是，編碼器的每個輸出計算得到的關聯(lián)分數(shù)，表示了每個編碼器的輸出對解碼器 $t$ 時刻決策的重要程度。

注意力機制提出后受到了廣泛關注，并在語音識別、圖像描述等應用上有很好的效果。

6.PyTorch示例：基于GRU和Attention的機器翻譯

完整代碼詳見GitHub

1.公共模塊（logger.py）

這里提到的公共模塊主要是日志處理模塊。在數(shù)據(jù)處理、模型訓練等過程中，需要保留必要的日志信息，這樣可以對程序的運行過程、運行結果進行記錄和分析。這里記錄日志的方式是同時輸出到文件和控制臺。

import logging as logger
logger.basicConfig(level=logger.DEBUG,
                   format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                   datefmt='%Y-%m-%d %H:%M:%S -',
                   filename='log.txt',
                   filemode='a')  # or 'w', default 'a'
console = logger.StreamHandler()
console.setLevel(logger.INFO)
formatter = logger.Formatter('%(asctime)s %(name)-6s: %(levelname)-6s %(message)s')
console.setFormatter(formatter)
logger.getLogger('').addHandler(console)

2.數(shù)據(jù)處理模塊（process.py）

數(shù)據(jù)處理模塊主要定義模型訓練需要的一些數(shù)據(jù)處理，包括從文件加載數(shù)據(jù)，數(shù)據(jù)解析，和一些輔助函數(shù)。

from __future__ import unicode_literals, print_function, division
import math
import re
import time
import jieba
import torch
import unicodedata
from torch.autograd import Variable
from logger import logger

use_cuda = torch.cuda.is_available()
SOS_token = 0
EOS_token = 1
# 中文的時候要設置大一些
MAX_LENGTH = 25


def unicodeToAscii(s):
    '''
    Unicode轉換成ASCII，http://stackoverflow.com/a/518232/2809427
    :param s:
    :return:
    '''
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )


def normalizeString(s):
    '''
    轉小寫，去除非法字符
    :param s:
    :return:
    '''
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    # 中文不能進行下面的處理
    # s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)
    return s


class Lang:
    def __init__(self, name):
        '''
        添加 need_cut 可根據(jù)語種進行不同的分詞邏輯處理
        :param name: 語種名稱
        '''
        self.name = name
        self.need_cut = self.name == 'cmn'
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # 初始化詞數(shù)為2：SOS & EOS

    def addSentence(self, sentence):
        '''
        從語料中添加句子到 Lang
        :param sentence: 語料中的每個句子
        '''
        if self.need_cut:
            sentence = cut(sentence)
        for word in sentence.split(' '):
            if len(word) > 0:
                self.addWord(word)

    def addWord(self, word):
        '''
        向 Lang 中添加每個詞，并統(tǒng)計詞頻，如果是新詞修改詞表大小
        :param word:
        '''
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


def cut(sentence, use_jieba=False):
    '''
    對句子分詞。
    :param sentence: 要分詞的句子
    :param use_jieba: 是否使用 jieba 進行智能分詞，默認按單字切分
    :return: 分詞結果，空格區(qū)分
    '''
    if use_jieba:
        return ' '.join(jieba.cut(sentence))
    else:
        words = [word for word in sentence]
        return ' '.join(words)


import jieba.posseg as pseg


def tag(sentence):
    words = pseg.cut(sentence)
    result = ''
    for w in words:
        result = result + w.word + "/" + w.flag + " "
    return result


def readLangs(lang1, lang2, reverse=False):
    '''

    :param lang1: 源語言
    :param lang2: 目標語言
    :param reverse: 是否逆向翻譯
    :return: 源語言實例，目標語言實例，詞語對
    '''
    logger.info("Reading lines...")

    # 讀取txt文件并分割成行
    lines = open('data/%s-%s.txt' % (lang1, lang2), encoding='utf-8'). \
        read().strip().split('\n')

    # 按行處理成 源語言-目標語言對，并做預處理
    pairs = [[normalizeString(s) for s in l.split('\t')] for l in lines]

    # Reverse pairs, make Lang instances
    if reverse:
        pairs = [list(reversed(p)) for p in pairs]
        input_lang = Lang(lang2)
        output_lang = Lang(lang1)
    else:
        input_lang = Lang(lang1)
        output_lang = Lang(lang2)

    return input_lang, output_lang, pairs


eng_prefixes = (
    "i am ", "i m ",
    "he is", "he s ",
    "she is", "she s",
    "you are", "you re ",
    "we are", "we re ",
    "they are", "they re "
)


def filterPair(p):
    '''
    按自定義最大長度過濾
    '''
    return len(p[0].split(' ')) < MAX_LENGTH and \
           len(p[1].split(' ')) < MAX_LENGTH and \
           p[1].startswith(eng_prefixes)


def filterPairs(pairs):
    return [pair for pair in pairs if filterPair(pair)]


def prepareData(lang1, lang2, reverse=False):
    input_lang, output_lang, pairs = readLangs(lang1, lang2, reverse)
    logger.info("Read %s sentence pairs" % len(pairs))
    pairs = filterPairs(pairs)
    logger.info("Trimmed to %s sentence pairs" % len(pairs))
    logger.info("Counting words...")
    for pair in pairs:
        input_lang.addSentence(pair[0])
        output_lang.addSentence(pair[1])
    logger.info("Counted words:")
    logger.info('%s, %d' % (input_lang.name, input_lang.n_words))
    logger.info('%s, %d' % (output_lang.name, output_lang.n_words))
    return input_lang, output_lang, pairs


def indexesFromSentence(lang, sentence):
    '''
    :param lang:
    :param sentence:
    :return:
    '''
    return [lang.word2index[word] for word in sentence.split(' ') if len(word) > 0]


def variableFromSentence(lang, sentence):
    if lang.need_cut:
        sentence = cut(sentence)
    # logger.info("cuted sentence: %s" % sentence)
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    result = Variable(torch.LongTensor(indexes).view(-1, 1))
    if use_cuda:
        return result.cuda()
    else:
        return result


def variablesFromPair(input_lang, output_lang, pair):
    input_variable = variableFromSentence(input_lang, pair[0])
    target_variable = variableFromSentence(output_lang, pair[1])
    return (input_variable, target_variable)


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (- %s)' % (asMinutes(s), asMinutes(rs))


if __name__ == "__main__":
    s = 'Fans of Belgium cheer prior to the 2018 FIFA World Cup Group G match between Belgium and Tunisia in Moscow, Russia, June 23, 2018.'
    s = '結婚的和尚未結婚的和尚'
    s = "買張下周三去南海的飛機票，海航的"
    s = "過幾天天天天氣不好。"

    a = cut(s, use_jieba=True)
    print(a)
    print(tag(s))

3.模型定義（model.py）

這部分主要是循環(huán)神經網(wǎng)絡RNN的定義，包括編碼器和解碼器兩個RNN。

import torch
from torch import nn
from torch.autograd import Variable
from torch.nn import functional as F
from logger import logger
# from process import cut
from process import MAX_LENGTH

use_cuda = torch.cuda.is_available()


class EncoderRNN(nn.Module):
    '''
    編碼器的定義
    '''

    def __init__(self, input_size, hidden_size, n_layers=1):
        '''
        初始化過程
        :param input_size: 輸入向量長度，這里是詞匯表大小
        :param hidden_size: 隱藏層大小
        :param n_layers: 疊加層數(shù)
        '''
        super(EncoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(input_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)

    def forward(self, input, hidden):
        '''
        前向計算過程
        :param input: 輸入
        :param hidden: 隱藏層狀態(tài)
        :return: 編碼器輸出，隱藏層狀態(tài)
        '''
        try:
            embedded = self.embedding(input).view(1, 1, -1)
            output = embedded
            for i in range(self.n_layers):
                output, hidden = self.gru(output, hidden)
            return output, hidden
        except Exception as err:
            logger.error(err)

    def initHidden(self):
        '''
        隱藏層狀態(tài)初始化
        :return: 初始化過的隱藏層狀態(tài)
        '''
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result


class DecoderRNN(nn.Module):
    '''
    解碼器定義
    '''

    def __init__(self, hidden_size, output_size, n_layers=1):
        '''
        初始化過程
        :param hidden_size: 隱藏層大小
        :param output_size: 輸出大小
        :param n_layers: 疊加層數(shù)
        '''
        super(DecoderRNN, self).__init__()
        self.n_layers = n_layers
        self.hidden_size = hidden_size

        self.embedding = nn.Embedding(output_size, hidden_size)
        self.gru = nn.GRU(hidden_size, hidden_size)
        self.out = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax()

    def forward(self, input, hidden):
        '''
        前向計算過程
        :param input: 輸入信息
        :param hidden: 隱藏層狀態(tài)
        :return: 解碼器輸出，隱藏層狀態(tài)
        '''
        try:
            output = self.embedding(input).view(1, 1, -1)
            for i in range(self.n_layers):
                output = F.relu(output)
                output, hidden = self.gru(output, hidden)
            output = self.softmax(self.out(output[0]))
            return output, hidden
        except Exception as err:
            logger.error(err)

    def initHidden(self):
        '''
        隱藏層狀態(tài)初始化
        :return: 初始化過的隱藏層狀態(tài)
        '''
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result


class AttnDecoderRNN(nn.Module):
    '''
    帶注意力的解碼器的定義
    '''

    def __init__(self, hidden_size, output_size, n_layers=1, dropout_p=0.1, max_length=MAX_LENGTH):
        '''
        帶注意力的解碼器初始化過程
        :param hidden_size: 隱藏層大小
        :param output_size: 輸出大小
        :param n_layers: 疊加層數(shù)
        :param dropout_p: dropout率定義
        :param max_length: 接受的最大句子長度
        '''
        super(AttnDecoderRNN, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.dropout_p = dropout_p
        self.max_length = max_length

        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.attn = nn.Linear(self.hidden_size * 2, self.max_length)
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.dropout_p)
        self.gru = nn.GRU(self.hidden_size, self.hidden_size)
        self.out = nn.Linear(self.hidden_size, self.output_size)

    def forward(self, input, hidden, encoder_output, encoder_outputs):
        '''
        前向計算過程
        :param input: 輸入信息
        :param hidden: 隱藏層狀態(tài)
        :param encoder_output: 編碼器分時刻的輸出
        :param encoder_outputs: 編碼器全部輸出
        :return: 解碼器輸出，隱藏層狀態(tài)，注意力權重
        '''
        try:
            embedded = self.embedding(input).view(1, 1, -1)
            embedded = self.dropout(embedded)

            attn_weights = F.softmax(
                self.attn(torch.cat((embedded[0], hidden[0]), 1)), dim=1)
            attn_applied = torch.bmm(attn_weights.unsqueeze(0),
                                     encoder_outputs.unsqueeze(0))

            output = torch.cat((embedded[0], attn_applied[0]), 1)
            output = self.attn_combine(output).unsqueeze(0)

            for i in range(self.n_layers):
                output = F.relu(output)
                output, hidden = self.gru(output, hidden)

            output = F.log_softmax(self.out(output[0]), dim=1)
            return output, hidden, attn_weights
        except Exception as err:
            logger.error(err)

    def initHidden(self):
        '''
        隱藏層狀態(tài)初始化
        :return: 初始化過的隱藏層狀態(tài)
        '''
        result = Variable(torch.zeros(1, 1, self.hidden_size))
        if use_cuda:
            return result.cuda()
        else:
            return result

4.訓練模塊（train.py）

訓練模塊包括訓練過程的定義和評估方法的定義。

import sys
import random
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from torch import nn
from torch import optim
from torch.autograd import Variable
from process import *

use_cuda = torch.cuda.is_available()


def evaluate(input_lang, output_lang, encoder, decoder, sentence, max_length=MAX_LENGTH):
    '''
    單句評估
    :param input_lang: 源語言信息
    :param output_lang: 目標語言信息
    :param encoder: 編碼器
    :param decoder: 解碼器
    :param sentence: 要評估的句子
    :param max_length: 可接受最大長度
    :return: 翻譯過的句子和注意力信息
    '''
    # 輸入句子預處理
    input_variable = variableFromSentence(input_lang, sentence)
    input_length = input_variable.size()[0]
    encoder_hidden = encoder.initHidden()

    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs

    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(input_variable[ei],
                                                 encoder_hidden)
        encoder_outputs[ei] = encoder_outputs[ei] + encoder_output[0][0]

    decoder_input = Variable(torch.LongTensor([[SOS_token]]))  # 起始標志 SOS
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    decoder_hidden = encoder_hidden

    decoded_words = []
    decoder_attentions = torch.zeros(max_length, max_length)
    # 翻譯過程
    for di in range(max_length):
        decoder_output, decoder_hidden, decoder_attention = decoder(
            decoder_input, decoder_hidden, encoder_output, encoder_outputs)
        decoder_attentions[di] = decoder_attention.data
        topv, topi = decoder_output.data.topk(1)
        ni = topi[0][0].item()
        # 當前時刻輸出為句子結束標志，則結束
        if ni == EOS_token:
            decoded_words.append('<EOS>')
            break
        else:
            decoded_words.append(output_lang.index2word[ni])

        decoder_input = Variable(torch.LongTensor([[ni]]))
        decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    return decoded_words, decoder_attentions[:di + 1]


teacher_forcing_ratio = 0.5


def train(input_variable, target_variable, encoder, decoder, encoder_optimizer, decoder_optimizer, criterion,
          max_length=MAX_LENGTH):
    '''
    單次訓練過程，
    :param input_variable: 源語言信息
    :param target_variable: 目標語言信息
    :param encoder: 編碼器
    :param decoder: 解碼器
    :param encoder_optimizer: 編碼器的優(yōu)化器
    :param decoder_optimizer: 解碼器的優(yōu)化器
    :param criterion: 評價準則，即損失函數(shù)的定義
    :param max_length: 接受的單句最大長度
    :return: 本次訓練的平均損失
    '''
    encoder_hidden = encoder.initHidden()

    # 清楚優(yōu)化器狀態(tài)
    encoder_optimizer.zero_grad()
    decoder_optimizer.zero_grad()

    input_length = input_variable.size()[0]
    target_length = target_variable.size()[0]
    # print(input_length, " -> ", target_length)

    encoder_outputs = Variable(torch.zeros(max_length, encoder.hidden_size))
    encoder_outputs = encoder_outputs.cuda() if use_cuda else encoder_outputs
    # print("encoder_outputs shape ", encoder_outputs.shape)
    loss = 0

    # 編碼過程
    for ei in range(input_length):
        encoder_output, encoder_hidden = encoder(
            input_variable[ei], encoder_hidden)
        encoder_outputs[ei] = encoder_output[0][0]

    decoder_input = Variable(torch.LongTensor([[SOS_token]]))
    decoder_input = decoder_input.cuda() if use_cuda else decoder_input

    decoder_hidden = encoder_hidden

    use_teacher_forcing = True if random.random() < teacher_forcing_ratio else False

    if use_teacher_forcing:
        # Teacher forcing: 以目標作為下一個輸入
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_output, encoder_outputs)
            loss += criterion(decoder_output, target_variable[di])
            decoder_input = target_variable[di]  # Teacher forcing

    else:
        # Without teacher forcing: 網(wǎng)絡自己預測的輸出為下一個輸入
        for di in range(target_length):
            decoder_output, decoder_hidden, decoder_attention = decoder(
                decoder_input, decoder_hidden, encoder_output, encoder_outputs)
            topv, topi = decoder_output.data.topk(1)
            ni = topi[0][0]

            decoder_input = Variable(torch.LongTensor([[ni]]))
            decoder_input = decoder_input.cuda() if use_cuda else decoder_input

            loss += criterion(decoder_output, target_variable[di])
            if ni == EOS_token:
                break

    # 反向傳播
    loss.backward()

    # 網(wǎng)絡狀態(tài)更新
    encoder_optimizer.step()
    decoder_optimizer.step()

    return loss / target_length


def showPlot(points):
    '''
    繪制圖像
    :param points:
    :return:
    '''
    plt.figure()
    fig, ax = plt.subplots()
    # this locator puts ticks at regular intervals
    loc = ticker.MultipleLocator(base=0.2)
    ax.yaxis.set_major_locator(loc)
    plt.plot(points)


def trainIters(input_lang, output_lang, pairs, encoder, decoder, n_iters, print_every=1000, plot_every=100,
               learning_rate=0.01):
    '''
    訓練過程,可以指定迭代次數(shù)，每次迭代調用 前面定義的train函數(shù)，并在迭代結束調用繪制圖像的函數(shù)
    :param input_lang: 輸入語言實例
    :param output_lang: 輸出語言實例
    :param pairs: 語料中的源語言-目標語言對
    :param encoder: 編碼器
    :param decoder: 解碼器
    :param n_iters: 迭代次數(shù)
    :param print_every: 打印loss間隔
    :param plot_every: 繪制圖像間隔
    :param learning_rate: 學習率
    :return:
    '''
    start = time.time()
    plot_losses = []
    print_loss_total = 0  # Reset every print_every
    plot_loss_total = 0  # Reset every plot_every

    encoder_optimizer = optim.SGD(encoder.parameters(), lr=learning_rate)
    decoder_optimizer = optim.SGD(decoder.parameters(), lr=learning_rate)
    training_pairs = [variablesFromPair(input_lang, output_lang, random.choice(pairs))
                      for i in range(n_iters)]
    # 損失函數(shù)定義
    criterion = nn.NLLLoss()

    for iter in range(1, n_iters + 1):
        training_pair = training_pairs[iter - 1]
        input_variable = training_pair[0]
        target_variable = training_pair[1]

        loss = train(input_variable, target_variable, encoder,
                     decoder, encoder_optimizer, decoder_optimizer, criterion)
        print_loss_total += loss
        plot_loss_total += loss

        if iter % print_every == 0:
            print_loss_avg = print_loss_total / print_every
            print_loss_total = 0
            logger.info('%s (%d %d%%) %.4f' % (timeSince(start, iter / n_iters),
                                               iter, iter / n_iters * 100, print_loss_avg))

        if iter % plot_every == 0:
            plot_loss_avg = plot_loss_total / plot_every
            plot_losses.append(plot_loss_avg)
            plot_loss_total = 0

    showPlot(plot_losses)


def evaluateRandomly(input_lang, output_lang, pairs, encoder, decoder, n=10):
    '''
    從語料中隨機選取句子進行評估
    '''
    for i in range(n):
        pair = random.choice(pairs)
        logger.info('> %s' % pair[0])
        logger.info('= %s' % pair[1])
        output_words, attentions = evaluate(input_lang, output_lang, encoder, decoder, pair[0])
        output_sentence = ' '.join(output_words)
        logger.info('< %s' % output_sentence)
        logger.info('')


def showAttention(input_sentence, output_words, attentions):
    try:
        # 添加繪圖中的中文顯示
        plt.rcParams['font.sans-serif'] = ['STSong']  # 宋體
        plt.rcParams['axes.unicode_minus'] = False  # 用來正常顯示負號
        # 使用 colorbar 初始化繪圖
        fig = plt.figure()
        ax = fig.add_subplot(111)
        cax = ax.matshow(attentions.numpy(), cmap='bone')
        fig.colorbar(cax)

        # 設置x，y軸信息
        ax.set_xticklabels([''] + input_sentence.split(' ') +
                           ['<EOS>'], rotation=90)
        ax.set_yticklabels([''] + output_words)

        # 顯示標簽
        ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
        ax.yaxis.set_major_locator(ticker.MultipleLocator(1))

        plt.show()
    except Exception as err:
        logger.error(err)


def evaluateAndShowAtten(input_lang, ouput_lang, input_sentence, encoder1, attn_decoder1):
    output_words, attentions = evaluate(input_lang, ouput_lang,
                                        encoder1, attn_decoder1, input_sentence)
    logger.info('input = %s' % input_sentence)
    logger.info('output = %s' % ' '.join(output_words))
    # 如果是中文需要分詞
    if input_lang.name == 'cmn':
        print(input_lang.name)
        input_sentence = cut(input_sentence)
    showAttention(input_sentence, output_words, attentions)

5.訓練過程（seq2seq.py）

該模塊主要是整個訓練過程，調用已經定義好的訓練方法，完成整個預料上的訓練，并把相應模型保存到文件，以方便隨時評估和模型調用，這樣不用每次都重新執(zhí)行訓練過程（因為從下面給出的訓練結果可以看出這個過程很漫長）。

import pickle
import sys
from io import open
from model import AttnDecoderRNN
from model import EncoderRNN
from train import *

use_cuda = torch.cuda.is_available()
logger.info("Use cuda:{}".format(use_cuda))
input = 'eng'
output = 'cmn'
# 從參數(shù)接收要翻譯的語種名詞
if len(sys.argv) > 1:
    output = sys.argv[1]
logger.info('%s -> %s' % (input, output))

# 處理語料庫
input_lang, output_lang, pairs = prepareData(input, output, True)
logger.info(random.choice(pairs))

# 查看兩種語言的詞匯大小情況
logger.info('input_lang.n_words: %d' % input_lang.n_words)
logger.info('output_lang.n_words: %d' % output_lang.n_words)

# 保存處理過的語言信息，評估時加載使用
pickle.dump(input_lang, open('./data/%s_%s_input_lang.pkl' % (input, output), "wb"))
pickle.dump(output_lang, open('./data/%s_%s_output_lang.pkl' % (input, output), "wb"))
pickle.dump(pairs, open('./data/%s_%s_pairs.pkl' % (input, output), "wb"))
logger.info('lang saved.')

# 編碼器和解碼器的實例化
hidden_size = 256
encoder1 = EncoderRNN(input_lang.n_words, hidden_size)
attn_decoder1 = AttnDecoderRNN(hidden_size, output_lang.n_words,
                               1, dropout_p=0.1)
if use_cuda:
    encoder1 = encoder1.cuda()
    attn_decoder1 = attn_decoder1.cuda()

logger.info('train start. ')
# 訓練過程，指定迭代次數(shù)，此處為迭代100000次，每1000次打印中間信息
trainIters(input_lang, output_lang, pairs, encoder1, attn_decoder1, 100000, print_every=1000)
logger.info('train end. ')

# 保存編碼器和解碼器網(wǎng)絡狀態(tài)
torch.save(encoder1.state_dict(), open('./data/%s_%s_encoder1.stat' % (input, output), 'wb'))
torch.save(attn_decoder1.state_dict(), open('./data/%s_%s_attn_decoder1.stat' % (input, output), 'wb'))
logger.info('stat saved.')

# 保存整個網(wǎng)絡
torch.save(encoder1, open('./data/%s_%s_encoder1.model' % (input, output), 'wb'))
torch.save(attn_decoder1, open('./data/%s_%s_attn_decoder1.model' % (input, output), 'wb'))
logger.info('model saved.')

訓練結果如下：

C:\ProgramData\Anaconda3\python.exe E:/workspace/python/chapter7/seq2seq.py
2019-09-01 23:18:50,189 root  : INFO   Use cuda:True
2019-09-01 23:18:50,190 root  : INFO   eng -> cmn
2019-09-01 23:18:50,190 root  : INFO   Reading lines...
2019-09-01 23:18:50,470 root  : INFO   Read 19578 sentence pairs
2019-09-01 23:18:50,487 root  : INFO   Trimmed to 695 sentence pairs
2019-09-01 23:18:50,487 root  : INFO   Counting words...
2019-09-01 23:18:50,492 root  : INFO   Counted words:
2019-09-01 23:18:50,492 root  : INFO   cmn, 994
2019-09-01 23:18:50,492 root  : INFO   eng, 887
2019-09-01 23:18:50,492 root  : INFO   ['他在生你的氣。', 'he is angry with you .']
2019-09-01 23:18:50,492 root  : INFO   input_lang.n_words: 994
2019-09-01 23:18:50,492 root  : INFO   output_lang.n_words: 887
2019-09-01 23:18:50,494 root  : INFO   lang saved.
2019-09-01 23:18:53,528 root  : INFO   train start. 
2019-09-01 23:19:59,536 root  : INFO   1m 6s (- 108m 54s) (1000 1%) 3.4915
2019-09-01 23:20:49,542 root  : INFO   1m 56s (- 94m 44s) (2000 2%) 3.1642
2019-09-01 23:21:40,365 root  : INFO   2m 46s (- 89m 54s) (3000 3%) 2.8599
2019-09-01 23:22:31,133 root  : INFO   3m 37s (- 87m 2s) (4000 4%) 2.5942
2019-09-01 23:23:22,415 root  : INFO   4m 28s (- 85m 8s) (5000 5%) 2.2696
2019-09-01 23:24:13,565 root  : INFO   5m 20s (- 83m 33s) (6000 6%) 1.9124
2019-09-01 23:25:05,176 root  : INFO   6m 11s (- 82m 17s) (7000 7%) 1.5661
2019-09-01 23:25:57,465 root  : INFO   7m 3s (- 81m 15s) (8000 8%) 1.2604
2019-09-01 23:26:49,536 root  : INFO   7m 56s (- 80m 12s) (9000 9%) 0.9532
2019-09-01 23:27:41,903 root  : INFO   8m 48s (- 79m 15s) (10000 10%) 0.7092
……
2019-09-02 00:39:19,369 root  : INFO   80m 25s (- 7m 57s) (91000 91%) 0.0139
2019-09-02 00:40:12,250 root  : INFO   81m 18s (- 7m 4s) (92000 92%) 0.0123
2019-09-02 00:41:04,909 root  : INFO   82m 11s (- 6m 11s) (93000 93%) 0.0126
2019-09-02 00:41:57,523 root  : INFO   83m 3s (- 5m 18s) (94000 94%) 0.0113
2019-09-02 00:42:50,670 root  : INFO   83m 57s (- 4m 25s) (95000 95%) 0.0082
2019-09-02 00:43:43,522 root  : INFO   84m 49s (- 3m 32s) (96000 96%) 0.0123
2019-09-02 00:44:35,892 root  : INFO   85m 42s (- 2m 39s) (97000 97%) 0.0088
2019-09-02 00:45:28,415 root  : INFO   86m 34s (- 1m 46s) (98000 98%) 0.0103
2019-09-02 00:46:20,990 root  : INFO   87m 27s (- 0m 53s) (99000 99%) 0.0105
2019-09-02 00:47:13,401 root  : INFO   88m 19s (- 0m 0s) (100000 100%) 0.0102
2019-09-02 00:47:13,813 root  : INFO   train end. 
2019-09-02 00:47:13,823 root  : INFO   stat saved.
2019-09-02 00:47:13,859 root  : INFO   model saved.

Process finished with exit code 0

6.評估過程（evaluate_eng_cmn.py）

對訓練好的神經網(wǎng)絡進行評估，可以從語料中隨機選取句子進行翻譯，也可以指定句子進行翻譯，并對翻譯過程中的注意力進行可視化。

import pickle
import matplotlib.pyplot as plt
import torch
from logger import logger
from train import evaluate
from train import evaluateAndShowAtten
from train import evaluateRandomly

input = 'eng'
output = 'cmn'
logger.info('%s -> %s' % (input, output))
# 加載處理好的語言信息
input_lang = pickle.load(open('./data/%s_%s_input_lang.pkl' % (input, output), "rb"))
output_lang = pickle.load(open('./data/%s_%s_output_lang.pkl' % (input, output), "rb"))
pairs = pickle.load(open('./data/%s_%s_pairs.pkl' % (input, output), 'rb'))
logger.info('lang loaded.')

# 加載訓練好的編碼器和解碼器
encoder1 = torch.load(open('./data/%s_%s_encoder1.model' % (input, output), 'rb'))
attn_decoder1 = torch.load(open('./data/%s_%s_attn_decoder1.model' % (input, output), 'rb'))
logger.info('model loaded.')


# 對單句進行評估并繪制注意力圖像
def evaluateAndShowAttention(sentence):
    evaluateAndShowAtten(input_lang, output_lang, sentence, encoder1, attn_decoder1)


evaluateAndShowAttention("他們肯定會相戀的。")
evaluateAndShowAttention("我現(xiàn)在正在學習。")

# 語料中的數(shù)據(jù)隨機選擇評估
evaluateRandomly(input_lang, output_lang, pairs, encoder1, attn_decoder1)

output_words, attentions = evaluate(input_lang, output_lang,
                                    encoder1, attn_decoder1, "我是中國人。")
plt.matshow(attentions.numpy())

日志如下：

C:\ProgramData\Anaconda3\python.exe E:/workspace/python/chapter7/evaluate_cmn_eng.py
2019-09-02 00:49:48,043 root  : INFO   eng -> cmn
2019-09-02 00:49:48,044 root  : INFO   lang loaded.
2019-09-02 00:49:50,110 root  : INFO   model loaded.
2019-09-02 00:49:51,197 root  : INFO   input = 他們肯定會相戀的。
2019-09-02 00:49:51,197 root  : INFO   output = they are sure to fall in love . <EOS>
cmn
2019-09-02 00:49:51,350 root  : INFO   input = 我現(xiàn)在正在學習。
cmn
2019-09-02 00:49:51,350 root  : INFO   output = i am studying now . <EOS>
2019-09-02 00:49:51,461 root  : INFO   > 他可能很快就到了。
2019-09-02 00:49:51,461 root  : INFO   = he is likely to arrive soon .
2019-09-02 00:49:51,485 root  : INFO   < he is likely to arrive soon . <EOS>
2019-09-02 00:49:51,485 root  : INFO   
2019-09-02 00:49:51,485 root  : INFO   > 我熟悉這個主題。
2019-09-02 00:49:51,485 root  : INFO   = i am familiar with this subject .
2019-09-02 00:49:51,507 root  : INFO   < i am familiar with this subject . <EOS>
2019-09-02 00:49:51,507 root  : INFO   
2019-09-02 00:49:51,507 root  : INFO   > 他的年紀可以開車了。
2019-09-02 00:49:51,507 root  : INFO   = he is old enough to drive a car .
2019-09-02 00:49:51,530 root  : INFO   < he is old enough to drive a car . <EOS>
2019-09-02 00:49:51,531 root  : INFO   
2019-09-02 00:49:51,531 root  : INFO   > 我們要去市中心吃比薩。
2019-09-02 00:49:51,531 root  : INFO   = we are going downtown to eat pizza .
2019-09-02 00:49:51,552 root  : INFO   < we are going downtown to eat pizza . <EOS>
2019-09-02 00:49:51,552 root  : INFO   
2019-09-02 00:49:51,552 root  : INFO   > 她有興趣學習新的想法。
2019-09-02 00:49:51,552 root  : INFO   = she is interested in learning new ideas .
2019-09-02 00:49:51,573 root  : INFO   < she is interested in learning new ideas . <EOS>
2019-09-02 00:49:51,573 root  : INFO   
2019-09-02 00:49:51,573 root  : INFO   > 他是一位有前途的學生。
2019-09-02 00:49:51,573 root  : INFO   = he is a promising student .
2019-09-02 00:49:51,591 root  : INFO   < he is a promising student . <EOS>
2019-09-02 00:49:51,591 root  : INFO   
2019-09-02 00:49:51,591 root  : INFO   > 他今天沒上學。
2019-09-02 00:49:51,591 root  : INFO   = he is absent from school today .
2019-09-02 00:49:51,609 root  : INFO   < he is absent from school today . <EOS>
2019-09-02 00:49:51,609 root  : INFO   
2019-09-02 00:49:51,609 root  : INFO   > 我期待她的來信。
2019-09-02 00:49:51,609 root  : INFO   = i am expecting a letter from her .
2019-09-02 00:49:51,628 root  : INFO   < i am expecting a letter from her . <EOS>
2019-09-02 00:49:51,628 root  : INFO   
2019-09-02 00:49:51,629 root  : INFO   > 他很窮。
2019-09-02 00:49:51,629 root  : INFO   = he is poor .
2019-09-02 00:49:51,640 root  : INFO   < he is poor . <EOS>
2019-09-02 00:49:51,640 root  : INFO   
2019-09-02 00:49:51,640 root  : INFO   > 他擅長應付小孩子。
2019-09-02 00:49:51,640 root  : INFO   = he is good at dealing with children .
2019-09-02 00:49:51,661 root  : INFO   < he is good at dealing with children . <EOS>
2019-09-02 00:49:51,661 root  : INFO   

Process finished with exit code 0

可視化結果如下：

image.png

最后編輯于：2019.09.26 08:07:50

?著作權歸作者所有,轉載或內容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌，老刑警劉巖，帶你破解...
沈念sama閱讀 227,224評論 6贊 529
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異，居然都是意外死亡，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 97,916評論 3贊 413
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人，你說我怎么就攤上這事。” “怎么了？”我有些...
開封第一講書人閱讀 175,014評論 0贊 373
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長。經常有香客問我，道長，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 62,466評論 1贊 308
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮，結果婚禮上，老公的妹妹穿的比我還像新娘。我一直安慰自己，他們只是感情好，可當我...
茶點故事閱讀 71,245評論 6贊 405
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著，像睡著了一般。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 54,795評論 1贊 320
城市分裂傳說
那天，我揣著相機與錄音，去河邊找鬼。笑死，一個胖子當著我的面吹牛，可吹牛的內容都是我干的。我是一名探鬼主播，決...
沈念sama閱讀 42,869評論 3贊 440
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了？” 一聲冷哼從身側響起，我...
開封第一講書人閱讀 42,010評論 0贊 285
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經...
沈念sama閱讀 48,524評論 1贊 331
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內容為張勛視角年9月15日...
茶點故事閱讀 40,487評論 3贊 354
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 42,634評論 1贊 366
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖，靈堂內的尸體忽然破棺而出，到底是詐尸還是另有隱情，我是刑警寧澤，帶...
沈念sama閱讀 38,173評論 5贊 355
?日本核電站爆炸內幕
正文年R本政府宣布，位于F島的核電站，受9級特大地震影響，放射性物質發(fā)生泄漏。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點故事閱讀 43,884評論 3贊 345
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧，春花似錦、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 34,282評論 0贊 25
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 35,541評論 1贊 281
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人。一個月前我還...
沈念sama閱讀 51,236評論 3贊 388
代替公主和親
正文我出身青樓，卻偏偏與公主長得像，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當晚...
茶點故事閱讀 47,623評論 2贊 370

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

跟我一起學PyTorch-08：循環(huán)神經網(wǎng)絡RNN

跟我一起學PyTorch-08：循環(huán)神經網(wǎng)絡RNN

1.序列數(shù)據(jù)處理

2.循環(huán)神經網(wǎng)絡

3.LSTM和GRU

4.LSTM在自然語言處理中的應用

1.詞性標注

2.情感分析

5.序列到序列網(wǎng)絡

1.序列到序列原理

2.注意力機制

6.PyTorch示例：基于GRU和Attention的機器翻譯

1.公共模塊（logger.py）

2.數(shù)據(jù)處理模塊（process.py）

3.模型定義（model.py）

4.訓練模塊（train.py）

5.訓練過程（seq2seq.py）

6.評估過程（evaluate_eng_cmn.py）

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

跟我一起學PyTorch-08：循環(huán)神經網(wǎng)絡RNN

1.序列數(shù)據(jù)處理

2.循環(huán)神經網(wǎng)絡

3.LSTM和GRU

4.LSTM在自然語言處理中的應用

1.詞性標注

2.情感分析

5.序列到序列網(wǎng)絡

1.序列到序列原理

2.注意力機制

6.PyTorch示例：基于GRU和Attention的機器翻譯

1.公共模塊（logger.py）

2.數(shù)據(jù)處理模塊（process.py）

3.模型定義（model.py）

4.訓練模塊（train.py）

5.訓練過程（seq2seq.py）

6.評估過程（evaluate_eng_cmn.py）

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频