TextCNN詳解

Convolutional Neural Networks for Sentence Classification(基于卷積神經網絡的句子分類)

三大頂會 ACL EMNLP NAACL

一、論文總覽:

Abstract:使用卷積神經網絡處理句子級別的文本分類,并在多個數據集上取得很好效果

Introduction:通過使用預訓練的詞向量和卷積神經網絡,文本提出一種簡單且有效的文本分類模型。

Model:TextCNN模型結構和正則化

Datasets and Experimental Setp:數據集介紹,實驗超參設置以及實驗結果。

Results and Discussion:實驗研究,通道個數討論和詞向量使用方法討論

Conclusion:全文總結

二、目標

(一)TextCnn

卷積層

池化層

(二)減少過擬合

正則化

Dropout

(三)超參數選擇

詞向量設置方式

卷積核大小

卷積核個數

激活函數

正則化

(四)代碼實現

三、論文總覽

深度學習的發展

詞向量的發展

CNN的發展

(一)Introduction

詞向量的發展:Deep learning models have achieved remarkable results in computer vision (Krizhevsky et al., 2012) and speech recognition (Graves et al., 2013) in recent years. Within natural language processing, much of the work with deep learning methods has involved learning word vector representations through neural language models (Bengio et al., 2003; Yih et al., 2011; Mikolov et al., 2013) and performing composition over the learned word vectors for classification (Collobert et al., 2011). Word vectors, wherein words are projected from a sparse, 1-of-V encoding (here V is the vocabulary size) onto a lower dimensional vector space via a hidden layer, are essentially feature extractors that encode semantic features of words in their dimensions. In such dense representations, semantically close words are likewise close—in euclidean or cosine distance—in the lower dimensional vector space.

CNN的發展:Convolutional neural networks (CNN) utilize layers with convolving filters that are applied to local features (LeCun et al., 1998). Originally invented for computer vision, CNN models have subsequently been shown to be effective for NLP and have achieved excellent results in semantic parsing (Yih et al., 2014), search query retrieval (Shen et al., 2014), sentence modeling (Kalchbrenner et al., 2014), and other traditional NLP tasks (Collobert et al., 2011).

In the present work, we train a simple CNN with one layer of convolution on top of word vectors obtained from an unsupervised neural language model. These vectors were trained by Mikolov et al. (2013) on100 billion words of Google News(詞向量來源), and are publicly available.1 We initially keep the word vectors static and learn only the other parameters of the model. Despite little tuning of hyperparameters, this simple model achieves excellent results on multiple benchmarks, suggesting that the pre-trained vectors are ‘universal’ feature extractors that can be utilized for various classification tasks(預訓練的詞向量可以一些任務上通用). Learning task-specific vectors through fine-tuning results in further improvements. We finally describe a simple modification to the architecture to allow for the use of both pre-trained and task-specific vectors by having multiple channels(混合使用詞向量)

Our work is philosophically similar to Razavian et al. (2014) which showed that for image classification, feature extractors obtained from a pretrained deep learning model perform well on a variety of tasks—including tasks that are very different from the original task for which the feature extractors were trained.

使用簡單的CNN模型在預訓練詞向量基本上進行微調就可以在文本分類任務上得到很好的結果

通過對詞向量進行微調而獲得的任務指向的詞向量能夠得到更好的結果。

我們也提出了一種即使使用靜態預訓練詞向量又使用任務指向詞向量的文本模型

最終我們在7個文本分類任務中的四個上都取得了最好的分類準確率

(二)Model

A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification卷積神經網絡用于句子分類的敏感性分析(和從業者指南)

2.1 Regularization TextCnn正則化

1.Dropout:在神經網絡的傳播過程中,讓某個神經元以一定的概率停止工作,從而增加模型的泛化能力

(三)Datasets and Experimental Setup

MR: Movie reviews with one sentence per review. Classification involves detecting positive/negative reviews (Pang and Lee, 2005).

SST-1: Stanford Sentiment Treebank—an extension of MR but with train/dev/test splits provided and fine-grained labels (very positive, positive, neutral, negative, very negative), re-labeled by Socher et al. (2013).

SST-2: Same as SST-1 but with neutral reviews removed and binary labels. ? Subj: Subjectivity dataset where the task is to classify a sentence as being subjective or objective (Pang and Lee, 2004).

TREC: TREC question dataset—task involves classifying a question into 6 question types (whether the question is about person, location, numeric information, etc.) (Li and Roth, 2002).5

CR: Customer reviews of various products (cameras, MP3s etc.). Task is to predict positive/negative reviews (Hu and Liu, 2004)

MPQA: Opinion polarity detection subtask of the MPQA dataset (Wiebe et al., 2005).

3.1 Hyperparameters and Training

參數設置:

windows(h):3,4,5 with 100 feature maps each

dropout rate(p):0.5

l2 constraint(s):3

mini-batch:50

來源:SST-2 驗證集上進行網格搜索

3.2 Pre-trained Word Vectors

We use the publicly available word2vec vectors that were trained on 100 billion words from Google News. The vectors have dimensionality of 300 and were trained using the continuous bag-of-words architecture

詞向量:word2vec

vocable size: 100billion

datas: Google News

dimension:300

architecture: CBOW

(四) Results and Discussion

Results of our models against other methods are listed in table 2. Our baseline modelwith all randomly initialized words (CNN-rand) does not perform well on its own(CNN+隨機初始化表現不好). While we had expected performance gains through the use of pre-trained vectors, we were surprised at the magnitude of the gains. Even a simple model with static vectors (CNN-static) performs remarkably well(如果使用預訓練的詞向量,會提升非常大), giving competitive results against the more sophisticated deep learning models that utilize complex pooling schemes (Kalchbrenner et al., 2014) or require parse trees to be computed beforehand (Socher et al., 2013). These results suggest that the pretrained vectors are good, ‘universal’ feature extractors and can be utilized across datasets. Finetuning the pre-trained vectors for each task gives still further improvements (CNN-non-static).

4.1 Multichannel vs. Single Channel Models (多通道和單通道的對比)

We had initially hoped that the multichannel architecture would prevent overfitting(希望通過多通道來避免過擬合)(by ensuring that the learned vectors do not deviate too far from the original values) and thus work better than the single channel model, especially on smaller datasets. The results,however(然而,實現結果差不多), are mixed, and further work on regularizing the fine-tuning process is warranted.For instance, instead of using an additional channel for the non-static portion(可以額外的增加非靜態的channel),one could maintain a single channel but employ extra dimensions that are allowed to be modified during training.

4.2 Static vs. Non-static Representations

As is the case with the single channel non-static model, the multichannel model is able to fine-tune the non-static channel to make it more specific to the task-at-hand.For example, good is most similar to bad in word2vec, presumably because they are (almost) syntactically equivalent. (舉例,在word2vec中,good和bad很接近,因為他們的語法是很接近的。)But for vectors in the non-static channel that were finetuned on the SST-2 dataset, this is no longer the case (table 3). Similarly, good is arguably closer to nice than it is to great for expressing sentiment, and this is indeed reflected in the learned vectors.

For (randomly initialized) tokens not in the set of pre-trained vectors, fine-tuning allows them to learn more meaningful representations: the network learns that exclamation marks are associated with effusive expressions and that commas are conjunctive (table 3).

4.3 Further Observations

We report on some further experiments and observations:

效果提升很多,因為使用了更多的feature maps。Kalchbrenner et al. (2014) report much worse results with a CNN that has essentially the same architecture as our single channel model. For example, their Max-TDNN (Time Delay Neural Network) with randomly initialized words obtains 37.4% on the SST-1 dataset, compared to 45.0% for our model. We attribute such discrepancy to our CNN having much more capacity (multiple filter widths and feature maps).

Dropout proved to be such a good regularizer that it was fine to use a larger than necessary network and simply let dropout regularize it. Dropout consistently added 2%–4% relative performance(dropout 可以提高2%-4%的表現). ? When randomly initializing words not in word2vec, we obtained slight improvements by sampling each dimension from U[?a, a] where a was chosen such that the randomly initialized vectors have the same variance as the pre-trained ones. It would be interesting to see if employing more sophisticated methods to mirror the distribution of pre-trained vectors in the initialization process gives further improvements.

We briefly experimented with another set of publicly available word vectors trained by Collobert et al. (2011) on Wikipedia,8 and found that word2vec gave far superior performance. It is not clear whether this is due to Mikolov et al. (2013)’s architecture or the 100 billion word Google News dataset.通過word2vec的訓練集效果好了很多。但是不清楚到底是模型好,還是因為數據集好

Adadelta (Zeiler, 2012) gave similar results to Adagrad (Duchi et al., 2011) but required fewer epochs.

5 Conclusion

In the present work we have described a series of experiments with convolutional neural networks built on top ofword2vec. Despite little tuning of hyperparameters, a simpleCNNwith one layer of convolution performs remarkably well. Our results add to the well-established evidence that unsupervised pre-training of word vectors is an important ingredient in deep learning for NLP.

關鍵點:

預訓練的詞向量——Word2Vec、Glove

卷積神經網絡結構——一維卷積、池化層

超參選擇——卷積核選擇、詞向量方式選擇

創新點:

提出基于CNN的文本分類模型TextCNN

提出了多種詞向量設置方式

在四個文本分類任務上取得最優的結果

對超參進行大量實驗和分析

啟發點:

在預訓練模型的基礎上微調能夠得到非常好的結果,這說明預訓練詞向量學習到了一些通用的特征

在預訓練詞向量的基礎上使用簡單模型比復雜模型表現的還要好

對于不在預訓練詞向量中的詞,微調能夠使得它們能夠學習更多的意義。

四、超參選擇(另一篇論文:A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification)

Embedding方式

卷積核大小

卷積核個數

激活函數

Dropout

L2正則

4.1 Baseline Configuration

We first consider the performance of a baseline CNN configuration. Specifically, we start with the architectural decisions and hyperparameters used in previous work (Kim, 2014) and described in Table 2. To contextualize the variance in performance attributable to various architecture decisions and hyperparameter settings, it is critical to assess the variance due strictly to the parameter estimation procedure.Most prior work, unfortunately, has not reported such variance, despite a highly stochastic learning procedure(之前的工作,忽略了一些參數的偏差). This variance is attributable to estimation via SGD, random dropout, and random weight parameter initialization.Holding all variables (including the folds) constant, we show that the mean performance calculated via 10-fold cross validation (CV) exhibits relatively high variance over repeated runs. (盡管保持參數均不變,但是10-fold cross的波動仍然很大)We replicated CV experiments 100 times for each dataset(復制100份數據), so that each replication was a 10-fold CV, wherein the folds were fixed. We recorded the average performance foreach replication and report the mean, minimum and maximum average accuracy (or AUC) values observed over 100 replications of CV (that is, we report means and ranges of averages calculated over 10-fold CV).(報告平均值等,就可以看出數值的波動)This provides a sense of the variance we might observe without any changes to the model. We did this for both static and non-static methods. For all experiments, we used the same preprocessing steps for the data as in (Kim, 2014). For SGD, we used the ADADELTA update rule (Zeiler, 2012), and set the minibatch size to 50. We randomly selected 10% of the training data as the validation set for early stopping.

4.2 Effect of input word vectors(embedding 設置)

A nice property of sentence classification models that start with distributed representations of words as inputs is the flexibility such architectures afford to swap in different pre-trained word vectors during model initialization. Therefore, we first explore the sensitivity of CNNs for sentence classification with respect to the input representations used. Specifically, we replaced word2vec with GloVe representations(兩種詞向量:word2vec和glove). Google word2vec uses a local context window model trained on 100 billion words from Google News (Mikolov et al., 2013), while GloVe is a model based on global wordword co-occurrence statistics (Pennington et al., 2014). We used a GloVe model trained on a corpus of 840 billion tokens of web data. For both word2vec and GloVe we induce 300-dimensional word vectors. We report results achieved using GloVe representations in Table 3. Here we only report non-static GloVe results (which again uniformely outperformed the static variant).

We also experimented with concatenating word2vec and GloVe representations, thus creating 600-dimensional word vectors to be used as input to the CNN. Pre-trained vectors may not always be available for specific words (either in word2vec or GloVe, or both); in such cases, we randomly initialized the corresponding subvectors. Results are reported in the final column of Table 3.

word2vec:300維 100 billion words

glove:300維 840billion words

word2vec + glove:600維

4.3 Effect of filter region size

5 Conclusions

5.1 Summary of Main Empirical Findings

參數固定下效果仍然有波動。Prior work has tended to report only the mean performance on datasets achieved by models. But this overlooks variance due solely to the stochastic inference procedure used. This can be substantial: holding everything constant (including the folds), so that variance is due exclusively to the stochastic inference procedure, we find that mean accuracy (calculated via 10 fold cross-validation) has a range of up to 1.5 points. And the range over the AUC achieved on the irony dataset is even greater – up to 3.4 points (see Table 3). More replication should be performed in future work, and ranges/variances should be reported, to prevent potentially spurious conclusions regarding relative model performance.

We find that, even when tuning them to the task at hand, the choice of input word vector representation (e.g., betweenword2vec and GloVe) has an impact on performance, however different representations perform better for different tasks.(詞向量表示會表現的更好,但是不同的詞向量在不同的任務中表現不一樣)At least for sentence classification, both seem to perform better than using one-hot vectors directly. We note, however, that: (1) this may not be the case if one has a sufficiently large amount of training data(如果數據量足夠大,one-hot可能效果更好),and, (2) the recent semi-supervised CNN model proposed by Johnson and Zhang (Johnson and Zhang, 2015) may improve performance, as compared to the simpler version of the model considered here (i.e., proposed in (Johnson and Zhang, 2014)).(使用更復雜的model可能更好)

The filter region size can have a large effect on performance, and should be tuned.

The number offeature maps(卷積核)can also play an important role in the performance, and increasing the number of feature maps will increase the training time of the model.

1-max pooling uniformly outperforms other pooling strategies.

Regularization has relatively little effect on the performance of the model.

5.2 Specific advice to practitioners

Drawing upon our empirical results, we provide the following guidance regarding CNN architecture and hyperparameters for practitioners looking to deploy CNNs for sentence classification tasks.

Consider starting with the basic configuration described in Table 2 and using non-static word2vec or GloVe rather than one-hot vectors. However, if the training dataset size is very large, it may be worthwhile to explore using one-hot vectors. Alternatively, if one has access to a large set of unlabeled in-domain data, (Johnson and Zhang, 2015) might also be an option.

卷積核的選擇:Line-search over the single filter region size to find the ‘best’ single region size(通過單卷積核的搜索,選取最優的單卷積核的大小及步數).A reasonable range might be1~10.However, for datasets with very long sentences like CR,it may be worth exploring larger filter region sizes(在長的句子里面可以適當調整卷積核).Once this ‘best’ region size is identified, it may be worth exploring combining multiple filters using regions sizes near this single best size(選出最好的之后,也可以考慮使用臨近的組合), given that empirically multiple ‘good’ region sizes always outperformed using only the single best region size.

Alter the number of feature maps for each filter region size from 100 to 600, and when this is being explored, use a small dropout rate (0.0-0.5) and a large max norm constraint. Note that increasing the number of feature maps will increase the running time, so there is a trade-off to consider. Also pay attention whether the best value found is near the border of the range (Bengio, 2012). If the best value is near 600, it may be worth trying larger values.

考慮不同的激活函數:Consider different activation functionsif possible: ReLU and tanh are the best overall candidates. And it might also be worth tryingno activation function at all(也可以不使用激活函數)for our one-layer CNN.

沒有必要去常識其他選項:Use 1-max pooling; it does not seem necessary to expend resources evaluating alternative strategies.

正則的選擇:Regarding regularization: When increasing the number of feature maps begins to reduce performance, try imposing stronger regularization, e.g., a dropout out rate larger than 0.5.

When assessing the performance of a model (or a particular configuration thereof), it is imperative to consider variance. Therefore, replications of the cross-fold validation procedure should be performed and variances and ranges should be considered.(當評估一個模型的性能(或其特定的配置)時,必須考慮方差。因此,應進行交叉驗證程序的重復,并應考慮方差和范圍。)

五、研究成果及意義

(一)研究成果

在七個文本分類任務中的四個取得了最好的分類效果

CNN-rand:使用隨機初始化向量

CNN-static:使用靜態預訓練的詞向量

CNN-non-static:使用微調的預訓練的詞向量

CNN-multichannel:同時使用靜態預訓練的詞向量和微調的預訓練的詞向量

(二)歷史意義

開啟了基于深度學習的文本分類的序幕

推動了卷積神經網絡在自然語言處理的發展

?著作權歸作者所有,轉載或內容合作請聯系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現的幾起案子,更是在濱河造成了極大的恐慌,老刑警劉巖,帶你破解...
    沈念sama閱讀 227,702評論 6 531
  • 序言:濱河連續發生了三起死亡事件,死亡現場離奇詭異,居然都是意外死亡,警方通過查閱死者的電腦和手機,發現死者居然都...
    沈念sama閱讀 98,143評論 3 415
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人,你說我怎么就攤上這事?!?“怎么了?”我有些...
    開封第一講書人閱讀 175,553評論 0 373
  • 文/不壞的土叔 我叫張陵,是天一觀的道長。 經常有香客問我,道長,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 62,620評論 1 307
  • 正文 為了忘掉前任,我火速辦了婚禮,結果婚禮上,老公的妹妹穿的比我還像新娘。我一直安慰自己,他們只是感情好,可當我...
    茶點故事閱讀 71,416評論 6 405
  • 文/花漫 我一把揭開白布。 她就那樣靜靜地躺著,像睡著了一般。 火紅的嫁衣襯著肌膚如雪。 梳的紋絲不亂的頭發上,一...
    開封第一講書人閱讀 54,940評論 1 321
  • 那天,我揣著相機與錄音,去河邊找鬼。 笑死,一個胖子當著我的面吹牛,可吹牛的內容都是我干的。 我是一名探鬼主播,決...
    沈念sama閱讀 43,024評論 3 440
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了?” 一聲冷哼從身側響起,我...
    開封第一講書人閱讀 42,170評論 0 287
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后,有當地人在樹林里發現了一具尸體,經...
    沈念sama閱讀 48,709評論 1 333
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內容為張勛視角 年9月15日...
    茶點故事閱讀 40,597評論 3 354
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發現自己被綠了。 大學時的朋友給我發了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 42,784評論 1 369
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖,靈堂內的尸體忽然破棺而出,到底是詐尸還是另有隱情,我是刑警寧澤,帶...
    沈念sama閱讀 38,291評論 5 357
  • 正文 年R本政府宣布,位于F島的核電站,受9級特大地震影響,放射性物質發生泄漏。R本人自食惡果不足惜,卻給世界環境...
    茶點故事閱讀 44,029評論 3 347
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧,春花似錦、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 34,407評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至,卻和暖如春,著一層夾襖步出監牢的瞬間,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 35,663評論 1 280
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人。 一個月前我還...
    沈念sama閱讀 51,403評論 3 390
  • 正文 我出身青樓,卻偏偏與公主長得像,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 47,746評論 2 370