優化算法進階
介紹更高級的優化算法
Momentum
%matplotlib inline
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
import torch
eta = 0.4
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
def gd_2d(x1, x2, s1, s2):
return (x1 - eta * 0.2 * x1, x2 - eta * 4 * x2, 0, 0)
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1 -0.943467, x2 -0.000073
可以看到,同一位置上,目標函數在豎直方向(軸方向)比在水平方向(軸方向)的斜率的絕對值更大。因此,給定學習率,梯度下降迭代自變量時會使自變量在豎直方向比在水平方向移動幅度更大。那么,我們需要一個較小的學習率從而避免自變量在豎直方向上越過目標函數最優解。然而,這會造成自變量在水平方向上朝最優解移動變慢。
下面我們試著將學習率調得稍大一點,此時自變量在豎直方向不斷越過最優解并逐漸發散。
eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(gd_2d))
epoch 20, x1 -0.387814, x2 -1673.365109
def momentum_2d(x1, x2, v1, v2):
v1 = beta * v1 + eta * 0.2 * x1
v2 = beta * v2 + eta * 4 * x2
return x1 - v1, x2 - v2, v1, v2
eta, beta = 0.4, 0.5
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1 -0.062843, x2 0.001202
eta = 0.6
d2l.show_trace_2d(f_2d, d2l.train_2d(momentum_2d))
epoch 20, x1 0.007188, x2 0.002553
def get_data_ch7():
data = np.genfromtxt('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t')
data = (data - data.mean(axis=0)) / data.std(axis=0)
return torch.tensor(data[:1500, :-1], dtype=torch.float32), \
torch.tensor(data[:1500, -1], dtype=torch.float32)
features, labels = get_data_ch7()
def init_momentum_states():
v_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
v_b = torch.zeros(1, dtype=torch.float32)
return (v_w, v_b)
def sgd_momentum(params, states, hyperparams):
for p, v in zip(params, states):
v.data = hyperparams['momentum'] * v.data + hyperparams['lr'] * p.grad.data
p.data -= v.data
我們先將動量超參數momentum設0.5
d2l.train_ch7(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.5}, features, labels)
loss: 0.243297, 0.057950 sec per epoch
將動量超參數momentum增大到0.9
d2l.train_ch7(sgd_momentum, init_momentum_states(),
{'lr': 0.02, 'momentum': 0.9}, features, labels)
loss: 0.260418, 0.059441 sec per epoch
可見目標函數值在后期迭代過程中的變化不夠平滑。直覺上,10倍小批量梯度比2倍小批量梯度大了5倍,我們可以試著將學習率減小到原來的1/5。此時目標函數值在下降了一段時間后變化更加平滑。
d2l.train_ch7(sgd_momentum, init_momentum_states(),
{'lr': 0.004, 'momentum': 0.9}, features, labels)
loss: 0.243650, 0.063532 sec per epoch
Pytorch Class
在Pytorch中,torch.optim.SGD已實現了Momentum。
d2l.train_pytorch_ch7(torch.optim.SGD, {'lr': 0.004, 'momentum': 0.9},
features, labels)
loss: 0.243692, 0.048604 sec per epoch
AdaGrad
%matplotlib inline
import math
import torch
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l
def adagrad_2d(x1, x2, s1, s2):
g1, g2, eps = 0.2 * x1, 4 * x2, 1e-6 # 前兩項為自變量梯度
s1 += g1 ** 2
s2 += g2 ** 2
x1 -= eta / math.sqrt(s1 + eps) * g1
x2 -= eta / math.sqrt(s2 + eps) * g2
return x1, x2, s1, s2
def f_2d(x1, x2):
return 0.1 * x1 ** 2 + 2 * x2 ** 2
eta = 0.4
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -2.382563, x2 -0.158591
下面將學習率增大到2。可以看到自變量更為迅速地逼近了最優解。
eta = 2
d2l.show_trace_2d(f_2d, d2l.train_2d(adagrad_2d))
epoch 20, x1 -0.002295, x2 -0.000000
Implement
同動量法一樣,AdaGrad算法需要對每個自變量維護同它一樣形狀的狀態變量。我們根據AdaGrad算法中的公式實現該算法。
def get_data_ch7():
data = np.genfromtxt('/home/kesci/input/airfoil4755/airfoil_self_noise.dat', delimiter='\t')
data = (data - data.mean(axis=0)) / data.std(axis=0)
return torch.tensor(data[:1500, :-1], dtype=torch.float32), \
torch.tensor(data[:1500, -1], dtype=torch.float32)
features, labels = get_data_ch7()
def init_adagrad_states():
s_w = torch.zeros((features.shape[1], 1), dtype=torch.float32)
s_b = torch.zeros(1, dtype=torch.float32)
return (s_w, s_b)
def adagrad(params, states, hyperparams):
eps = 1e-6
for p, s in zip(params, states):
s.data += (p.grad.data**2)
p.data -= hyperparams['lr'] * p.grad.data / torch.sqrt(s + eps)
word2vec
PTB 數據集 Skip-Gram 跳字模型 負采樣近似 訓練模型
詞嵌入基礎
我們在“循環神經網絡的從零開始實現”一節中使用 one-hot 向量表示單詞,雖然它們構造起來很容易,但通常并不是一個好選擇。一個主要的原因是,one-hot 詞向量無法準確表達不同詞之間的相似度,如我們常常使用的余弦相似度。
Word2Vec 詞嵌入工具的提出正是為了解決上面這個問題,它將每個詞表示成一個定長的向量,并通過在語料庫上的預訓練使得這些向量能較好地表達不同詞之間的相似和類比關系,以引入一定的語義信息。基于兩種概率模型的假設,我們可以定義兩種 Word2Vec 模型:
import collections
import math
import random
import sys
import time
import os
import numpy as np
import torch
from torch import nn
import torch.utils.data as Data
PTB 數據集
簡單來說,Word2Vec 能從語料中學到如何將離散的詞映射為連續空間中的向量,并保留其語義上的相似關系。那么為了訓練 Word2Vec 模型,我們就需要一個自然語言語料庫,模型將從中學習各個單詞間的關系,這里我們使用經典的 PTB 語料庫進行訓練。PTB (Penn Tree Bank) 是一個常用的小型語料庫,它采樣自《華爾街日報》的文章,包括訓練集、驗證集和測試集。我們將在PTB訓練集上訓練詞嵌入模型。
載入數據集
數據集訓練文件 ptb.train.txt 示例:
aer banknote berlitz calloway centrust cluett fromstein gitano guterman ...
pierre N years old will join the board as a nonexecutive director nov. N
mr. is chairman of n.v. the dutch publishing group
...
with open('/home/kesci/input/ptb_train1020/ptb.train.txt', 'r') as f:
lines = f.readlines() # 該數據集中句子以換行符為分割
raw_dataset = [st.split() for st in lines] # st是sentence的縮寫,單詞以空格為分割
print('# sentences: %d' % len(raw_dataset))
# 對于數據集的前3個句子,打印每個句子的詞數和前5個詞
# 句尾符為 '' ,生僻詞全用 '' 表示,數字則被替換成了 'N'
for st in raw_dataset[:3]:
print('# tokens:', len(st), st[:5])
# sentences: 42068
# tokens: 24 ['aer', 'banknote', 'berlitz', 'calloway', 'centrust']
# tokens: 15 ['pierre', '<unk>', 'N', 'years', 'old']
# tokens: 11 ['mr.', '<unk>', 'is', 'chairman', 'of']
建立詞語索引
counter = collections.Counter([tk for st in raw_dataset for tk in st]) # tk是token的縮寫
counter = dict(filter(lambda x: x[1] >= 5, counter.items())) # 只保留在數據集中至少出現5次的詞
idx_to_token = [tk for tk, _ in counter.items()]
token_to_idx = {tk: idx for idx, tk in enumerate(idx_to_token)}
dataset = [[token_to_idx[tk] for tk in st if tk in token_to_idx]
for st in raw_dataset] # raw_dataset中的單詞在這一步被轉換為對應的idx
num_tokens = sum([len(st) for st in dataset])
'# tokens: %d' % num_tokens
out:
'# tokens: 887100'
二次采樣
def discard(idx):
'''
@params:
idx: 單詞的下標
@return: True/False 表示是否丟棄該單詞
'''
return random.uniform(0, 1) < 1 - math.sqrt(
1e-4 / counter[idx_to_token[idx]] * num_tokens)
subsampled_dataset = [[tk for tk in st if not discard(tk)] for st in dataset]
print('# tokens: %d' % sum([len(st) for st in subsampled_dataset]))
def compare_counts(token):
return '# %s: before=%d, after=%d' % (token, sum(
[st.count(token_to_idx[token]) for st in dataset]), sum(
[st.count(token_to_idx[token]) for st in subsampled_dataset]))
print(compare_counts('the'))
print(compare_counts('join'))
*# tokens: 375995
*# the: before=50770, after=2161
*# join: before=45, after=45
提取中心詞和背景詞
def get_centers_and_contexts(dataset, max_window_size):
'''
@params:
dataset: 數據集為句子的集合,每個句子則為單詞的集合,此時單詞已經被轉換為相應數字下標
max_window_size: 背景詞的詞窗大小的最大值
@return:
centers: 中心詞的集合
contexts: 背景詞窗的集合,與中心詞對應,每個背景詞窗則為背景詞的集合
'''
centers, contexts = [], []
for st in dataset:
if len(st) < 2: # 每個句子至少要有2個詞才可能組成一對“中心詞-背景詞”
continue
centers += st
for center_i in range(len(st)):
window_size = random.randint(1, max_window_size) # 隨機選取背景詞窗大小
indices = list(range(max(0, center_i - window_size),
min(len(st), center_i + 1 + window_size)))
indices.remove(center_i) # 將中心詞排除在背景詞之外
contexts.append([st[idx] for idx in indices])
return centers, contexts
all_centers, all_contexts = get_centers_and_contexts(subsampled_dataset, 5)
tiny_dataset = [list(range(7)), list(range(7, 10))]
print('dataset', tiny_dataset)
for center, context in zip(*get_centers_and_contexts(tiny_dataset, 2)):
print('center', center, 'has contexts', context)
dataset [[0, 1, 2, 3, 4, 5, 6], [7, 8, 9]]
center 0 has contexts [1, 2]
center 1 has contexts [0, 2, 3]
center 2 has contexts [0, 1, 3, 4]
center 3 has contexts [2, 4]
center 4 has contexts [3, 5]
center 5 has contexts [4, 6]
center 6 has contexts [5]
center 7 has contexts [8]
center 8 has contexts [7, 9]
center 9 has contexts [7, 8]
注:數據批量讀取的實現需要依賴負采樣近似的實現,故放于負采樣近似部分進行講解。
kip-Gram 跳字模型
PyTorch 預置的 Embedding 層
embed = nn.Embedding(num_embeddings=10, embedding_dim=4)
print(embed.weight)
x = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long)
print(embed(x))
Parameter containing:
tensor([[-0.7417, -1.9469, -0.5745, 1.4267],
[ 1.1483, 1.4781, 0.3064, -0.2893],
[ 0.6840, 2.4566, -0.1872, -2.2061],
[ 0.3386, 1.3820, -0.3142, 0.2427],
[ 0.4802, -0.6375, -0.4730, 1.2114],
[ 0.7130, -0.9774, 0.5321, 1.4228],
[-0.6726, -0.5829, -0.4888, -0.3290],
[ 0.3152, -0.6827, 0.9950, -0.3326],
[-1.4651, 1.2344, 1.9976, -1.5962],
[ 0.0872, 0.0130, -2.1396, -0.6361]], requires_grad=True)
tensor([[[ 1.1483, 1.4781, 0.3064, -0.2893],
[ 0.6840, 2.4566, -0.1872, -2.2061],
[ 0.3386, 1.3820, -0.3142, 0.2427]],
[[ 0.4802, -0.6375, -0.4730, 1.2114],
[ 0.7130, -0.9774, 0.5321, 1.4228],
[-0.6726, -0.5829, -0.4888, -0.3290]]], grad_fn=<EmbeddingBackward>)
PyTorch 預置的批量乘法
X = torch.ones((2, 1, 4))
Y = torch.ones((2, 4, 6))
print(torch.bmm(X, Y).shape)
torch.Size([2, 1, 6])
Skip-Gram 模型的前向計算
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
'''
@params:
center: 中心詞下標,形狀為 (n, 1) 的整數張量
contexts_and_negatives: 背景詞和噪音詞下標,形狀為 (n, m) 的整數張量
embed_v: 中心詞的 embedding 層
embed_u: 背景詞的 embedding 層
@return:
pred: 中心詞與背景詞(或噪音詞)的內積,之后可用于計算概率 p(w_o|w_c)
'''
v = embed_v(center) # shape of (n, 1, d)
u = embed_u(contexts_and_negatives) # shape of (n, m, d)
pred = torch.bmm(v, u.permute(0, 2, 1)) # bmm((n, 1, d), (n, d, m)) => shape of (n, 1, m)
return pred
負采樣近似
def get_negatives(all_contexts, sampling_weights, K):
'''
@params:
all_contexts: [[w_o1, w_o2, ...], [...], ... ]
sampling_weights: 每個單詞的噪聲詞采樣概率
K: 隨機采樣個數
@return:
all_negatives: [[w_n1, w_n2, ...], [...], ...]
'''
all_negatives, neg_candidates, i = [], [], 0
population = list(range(len(sampling_weights)))
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
if i == len(neg_candidates):
# 根據每個詞的權重(sampling_weights)隨機生成k個詞的索引作為噪聲詞。
# 為了高效計算,可以將k設得稍大一點
i, neg_candidates = 0, random.choices(
population, sampling_weights, k=int(1e5))
neg, i = neg_candidates[i], i + 1
# 噪聲詞不能是背景詞
if neg not in set(contexts):
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)
注:除負采樣方法外,還有層序 softmax (hiererarchical softmax) 方法也可以用來解決計算量過大的問題,請參考原書10.2.2節。
批量讀取數據
def get_negatives(all_contexts, sampling_weights, K):
'''
@params:
all_contexts: [[w_o1, w_o2, ...], [...], ... ]
sampling_weights: 每個單詞的噪聲詞采樣概率
K: 隨機采樣個數
@return:
all_negatives: [[w_n1, w_n2, ...], [...], ...]
'''
all_negatives, neg_candidates, i = [], [], 0
population = list(range(len(sampling_weights)))
for contexts in all_contexts:
negatives = []
while len(negatives) < len(contexts) * K:
if i == len(neg_candidates):
# 根據每個詞的權重(sampling_weights)隨機生成k個詞的索引作為噪聲詞。
# 為了高效計算,可以將k設得稍大一點
i, neg_candidates = 0, random.choices(
population, sampling_weights, k=int(1e5))
neg, i = neg_candidates[i], i + 1
# 噪聲詞不能是背景詞
if neg not in set(contexts):
negatives.append(neg)
all_negatives.append(negatives)
return all_negatives
sampling_weights = [counter[w]**0.75 for w in idx_to_token]
all_negatives = get_negatives(all_contexts, sampling_weights, 5)
*注:除負采樣方法外,還有層序 softmax (hiererarchical softmax) 方法也可以用來解決計算量過大的問題,請參考[原書10.2.2節]
批量讀取數據
class MyDataset(torch.utils.data.Dataset):
def __init__(self, centers, contexts, negatives):
assert len(centers) == len(contexts) == len(negatives)
self.centers = centers
self.contexts = contexts
self.negatives = negatives
def __getitem__(self, index):
return (self.centers[index], self.contexts[index], self.negatives[index])
def __len__(self):
return len(self.centers)
def batchify(data):
'''
用作DataLoader的參數collate_fn
@params:
data: 長為batch_size的列表,列表中的每個元素都是__getitem__得到的結果
@outputs:
batch: 批量化后得到 (centers, contexts_negatives, masks, labels) 元組
centers: 中心詞下標,形狀為 (n, 1) 的整數張量
contexts_negatives: 背景詞和噪聲詞的下標,形狀為 (n, m) 的整數張量
masks: 與補齊相對應的掩碼,形狀為 (n, m) 的0/1整數張量
labels: 指示中心詞的標簽,形狀為 (n, m) 的0/1整數張量
'''
max_len = max(len(c) + len(n) for _, c, n in data)
centers, contexts_negatives, masks, labels = [], [], [], []
for center, context, negative in data:
cur_len = len(context) + len(negative)
centers += [center]
contexts_negatives += [context + negative + [0] * (max_len - cur_len)]
masks += [[1] * cur_len + [0] * (max_len - cur_len)] # 使用掩碼變量mask來避免填充項對損失函數計算的影響
labels += [[1] * len(context) + [0] * (max_len - len(context))]
batch = (torch.tensor(centers).view(-1, 1), torch.tensor(contexts_negatives),
torch.tensor(masks), torch.tensor(labels))
return batch
batch_size = 512
num_workers = 0 if sys.platform.startswith('win32') else 4
dataset = MyDataset(all_centers, all_contexts, all_negatives)
data_iter = Data.DataLoader(dataset, batch_size, shuffle=True,
collate_fn=batchify,
num_workers=num_workers)
for batch in data_iter:
for name, data in zip(['centers', 'contexts_negatives', 'masks',
'labels'], batch):
print(name, 'shape:', data.shape)
break
centers shape: torch.Size([512, 1])
contexts_negatives shape: torch.Size([512, 60])
masks shape: torch.Size([512, 60])
labels shape: torch.Size([512, 60])
訓練模型
損失函數
詞嵌入進階
介紹了GloVe模型和使用GloVe模型的近義詞和類比詞的實現
import torch
import torchtext.vocab as vocab
print([key for key in vocab.pretrained_aliases.keys() if "glove" in key])
cache_dir = "/home/kesci/input/GloVe6B5429"
glove = vocab.GloVe(name='6B', dim=50, cache=cache_dir)
print("一共包含%d個詞。" % len(glove.stoi))
print(glove.stoi['beautiful'], glove.itos[3366])
['glove.42B.300d', 'glove.840B.300d', 'glove.twitter.27B.25d', 'glove.twitter.27B.50d', 'glove.twitter.27B.100d', 'glove.twitter.27B.200d', 'glove.6B.50d', 'glove.6B.100d', 'glove.6B.200d', 'glove.6B.300d']
一共包含400000個詞。
3366 beautiful
求近義詞和類比詞
求近義詞
由于詞向量空間中的余弦相似性可以衡量詞語含義的相似性(為什么?),我們可以通過尋找空間中的 k 近鄰,來查詢單詞的近義詞。
def knn(W, x, k):
'''
@params:
W: 所有向量的集合
x: 給定向量
k: 查詢的數量
@outputs:
topk: 余弦相似性最大k個的下標
[...]: 余弦相似度
'''
cos = torch.matmul(W, x.view((-1,))) / (
(torch.sum(W * W, dim=1) + 1e-9).sqrt() * torch.sum(x * x).sqrt())
_, topk = torch.topk(cos, k=k)
topk = topk.cpu().numpy()
return topk, [cos[i].item() for i in topk]
def get_similar_tokens(query_token, k, embed):
'''
@params:
query_token: 給定的單詞
k: 所需近義詞的個數
embed: 預訓練詞向量
'''
topk, cos = knn(embed.vectors,
embed.vectors[embed.stoi[query_token]], k+1)
for i, c in zip(topk[1:], cos[1:]): # 除去輸入詞
print('cosine sim=%.3f: %s' % (c, (embed.itos[i])))
get_similar_tokens('chip', 3, glove)
cosine sim=0.856: chips
cosine sim=0.749: intel
cosine sim=0.749: electronics
100%|█████████▉| 398393/400000 [00:30<00:00, 38997.22it/s]
get_similar_tokens('baby', 3, glove)
cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl
get_similar_tokens('beautiful', 3, glove)
cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful
求類比詞
def get_analogy(token_a, token_b, token_c, embed):
'''
@params:
token_a: 詞a
token_b: 詞b
token_c: 詞c
embed: 預訓練詞向量
@outputs:
res: 類比詞d
'''
vecs = [embed.vectors[embed.stoi[t]]
for t in [token_a, token_b, token_c]]
x = vecs[1] - vecs[0] + vecs[2]
topk, cos = knn(embed.vectors, x, 1)
res = embed.itos[topk[0]]
return res
get_analogy('man', 'woman', 'son', glove)
out
'daughter'
get_analogy('beijing', 'china', 'tokyo', glove)
out:
'japan'
get_analogy('bad', 'worst', 'big', glove)
out:
'biggest'
get_analogy('do', 'did', 'go', glove)
out
'went'