Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
深度卷積生成敵對網絡無監督表示學習
論文:http://arxiv.org/pdf/1511.06434v2.pdf
ABSTRACT
摘要
In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.
近年來,卷積網絡(CNN)的監督式學習在計算機視覺應用中得到了廣泛的應用。相比之下,無監督的CNN學習受到的關注較少。在這項工作中,我們希望能夠幫助彌合有監督學習的CNN成功與無監督學習之間的差距。我們引入了一類稱為深度卷積生成對抗網絡(CNG)的類,它具有一定的架構約束,并證明它們是非監督學習的有力候選。對各種圖像數據集進行訓練,我們展示出令人信服的證據,證明我們深層卷積對抗對從發生器和鑒別器中的對象部分到場景學習了表示層次。此外,我們使用學習的功能進行新穎的任務 - 證明其作為一般圖像表示的適用性。
1 INTRODUCTION
1引言
Learning reusable feature representations from large unlabeled datasets has been an area of active research. In the context of computer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermediate representations, which can then be used on a variety of supervised learning tasks such as image classi?cation. We propose that one way to build good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques. One can additionally argue that their learning process and the lack of a heuristic cost function (such as pixel-wise independent mean-square error) are attractive to representation learning. GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs. There has been very limited published research in trying to understand and visualize what GANs learn, and the intermediate representations of multi-layer GANs.
從大型未標記數據集學習可重用特征表示一直是一個積極研究的領域。在計算機視覺的背景下,人們可以利用實際上無限量的未標記圖像和視頻來學習良好的中間表示,然后可以將其用于各種監督學習任務,如圖像分類。我們提出建立良好圖像表示的一種方法是通過對生成敵對網絡(GAN)進行訓練(Goodfellow等人,2014),并且隨后將生成器和鑒別器網絡的部分重用為監督任務的特征提取器。GAN為最大似然技術提供了一個有吸引力的替代方案。人們還可以爭辯說,他們的學習過程和缺乏啟發式成本函數(如像素方式的獨立均方誤差)對表示學習很有吸引力。據了解,GAN在訓練中不穩定,往往導致產生無意義輸出的發電機。在嘗試理解和可視化GAN學習的內容以及多層GAN的中間表示方面,發表的研究非常有限。
In this paper, we make the following contributions
在本文中,我們做出以下貢獻
? We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)
?我們提出并評估了一系列對卷積GAN的架構拓撲的約束條件,這些約束條件使得它們在大多數環境中都能夠穩定地進行訓練。我們將這類架構命名為Deep Convolutional GAN(DCGAN)
? We use the trained discriminators for image classi?cation tasks, showing competitive performance with other unsupervised algorithms.
?我們使用訓練過的鑒別器進行圖像分類任務,顯示與其他無監督算法的競爭性能。
? We visualize the ?lters learnt by GANs and empirically show that speci?c ?lters have learned to draw speci?c objects.
?我們將由GAN學習的濾波器可視化,并憑經驗顯示特定的濾波器已經學會了繪制特定的對象。
? We show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.
?我們證明生成器具有有趣的矢量算術屬性,可以輕松處理生成的樣本的許多語義質量。
2 RELATED WORK
2相關工作
2.1 REPRESENTATION LEARNING FROM UNLABELED DATA
2.1表示從UNLABELED數據中學習
Unsupervised representation learning is a fairly well studied problem in general computer vision research, as well as in the context of images. A classic approach to unsupervised representation learning is to do clustering on the data (for example using K-means), and leverage the clusters for improved classi?cation scores. In the context of images, one can do hierarchical clustering of image patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that encode an image into a compact code, and decode the code to reconstruct the image as accurately as possible. These methods have also been shown to learn good feature representations from image pixels. Deep belief networks (Lee et al., 2009) have also been shown to work well in learning hierarchical representations.
無監督表示學習在計算機視覺一般性研究中以及在圖像上下文中是一個相當好的研究問題。無監督表示學習的經典方法是對數據進行聚類(例如使用K均值),并利用聚類提高分類分數。在圖像上下文中,可以對圖像塊進行分層聚類(Coates&Ng,2012),以學習強大的圖像表示。另一種流行的方法是訓練自動編碼器(卷積,堆疊(Vincent et al。,2010),將代碼的內容和組成部分分開(Zhao et al。,2015),階梯結構(Rasmus等,2015) )將圖像編碼成緊湊的代碼,并對代碼進行解碼以盡可能準確地重建圖像。這些方法也被證明可以從圖像像素學習好的特征表示。深度信念網絡(Lee et al。,2009)也被證明在學習分層表示方面效果很好。
2.2 GENERATING NATURAL IMAGES
2.2生成自然圖像
Generative image models are well studied and fall into two categories: parametric and nonparametric.
生成圖像模型已經過很好的研究,分為兩類:參數化和非參數化。
The non-parametric models often do matching from a database of existing images, often matching patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution (Freeman et al., 2002) and in-painting (Hays & Efros, 2007).
非參數模型通常與現有圖像的數據庫進行匹配,通常匹配圖像塊,并且已經用于紋理合成(Efros等人,1999),超分辨率(Freeman等人,2002)和 - 繪畫(Hays&Efros,2007)。
Parametric models for generating images has been explored extensively (for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000)). However, generating natural images of the real world have had not much success until recently. A variational sampling approach to generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer from being blurry.Another approach generates images using an iterative forward diffusion process (Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects looking wobbly because of noise introduced in chaining multiple models. A recurrent network approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have also recently had some success with generating natural images. However, they have not leveraged the generators for supervised tasks.
用于生成圖像的參數模型已被廣泛探索(例如MNIST數字或紋理合成(Portilla&Simoncelli,2000))。然而,直到最近,生成真實世界的自然圖像并沒有取得太大的成功。用于生成圖像的變分抽樣方法(Kingma&Welling,2013)取得了一些成功,但樣本經常遭受模糊。另一種方法使用迭代正向擴散過程生成圖像(Sohl-Dickstein等,2015)。生成敵對網絡(Goodfellow et al。,2014)生成的圖像嘈雜和難以理解。這種方法的拉普拉斯金字塔延伸(Denton等人,2015)顯示出更高質量的圖像,但由于鏈接多個模型中引入的噪聲,它們仍然受到物體晃動的影響。經常性網絡方法(Gregor等,2015)和去卷積網絡方法(Dosovitskiy et al。,2014)最近也在生成自然圖像方面取得了一些成功。但是,他們沒有將發電機用于監督任務。
2.3 VISUALIZING THE INTERNALS OF CNNS
2.3可視化CNNS的內部
One constant criticism of using neural networks has been that they are black-box methods, with little understanding of what the networks do in the form of a simple human-consumable algorithm. In the context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and ?ltering the maximal activations, one can ?nd the approximate purpose of each convolution ?lter in the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that activates certain subsets of ?lters (Mordvintsev et al.).
對使用神經網絡的一個不斷批評是它們是黑盒子方法,幾乎不了解網絡以簡單的人類可消費算法的形式做什么。在CNN的情況下,Zeiler et。人。 (Zeiler&Fergus,2014)表明,通過使用反卷積和過濾最大激活,可以找出網絡中每個卷積濾波器的近似目的。類似地,在輸入上使用梯度下降可以讓我們檢查激活某些過濾器子集的理想圖像(Mordvintsev等人)。
3 APPROACH AND MODEL ARCHITECTURE
3方法和模型體系結構
Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to iteratively upscale low resolution generated images which can be modeled more reliably. We also encountered dif?culties attempting to scale GANs using CNN architectures commonly used in the supervised literature. However, after extensive model exploration we identi?ed a family of archi
使用CNN擴展GAN來模擬圖像的歷史嘗試已經失敗。這促使LAPGAN的作者(Denton等人,2015)開發了一種替代方法來迭代地提高可以更可靠地建模的低分辨率生成圖像。我們也遇到了困難,試圖使用監督文獻中常用的CNN架構來規?;疓AN。然而,經過廣泛的模型探索后,我們確定了一個archi系列
tectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models.
通過一系列數據集進行穩定培訓并允許培訓更高分辨率和更深層次的生成模型。
Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.
我們的方法的核心是采納和修改最近對CNN架構進行的三項變更。
The ?rst is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.
第一個是全卷積網絡(Springenberg et al。,2014),它用逐步卷積代替確定性空間匯聚函數(如maxpooling),允許網絡學習它自己的空間下采樣。我們在我們的生成器中使用這種方法,允許它學習它自己的空間上采樣和鑒別器。
Second is the trend towards eliminating fully connected layers on top of convolutional features. The strongest example of this is global average pooling which has been utilized in state of the art image classi?cation models (Mordvintsev et al.). We found global average pooling increased model stability but hurt convergence speed. A middle ground of directly connecting the highest convolutional features to the input and output respectively of the generator and discriminator worked well. The ?rst layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is ?attened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example model architecture.
其次是消除卷積特性之上的完全連接層的趨勢。這方面最強有力的例子就是全球平均匯集技術,這種技術已經應用于最先進的圖像分類模型(Mordvintsev等人)。我們發現全球平均匯聚增加了模型的穩定性,但卻傷害了收斂速度將最高卷積特征直接連接到發生器和鑒別器的輸入和輸出的中間地帶運行良好。GAN的第一層以統一的噪聲分布Z作為輸入,可以稱為完全連接,因為它只是一個矩陣乘法,但結果被重新整形為四維張量并用作卷積棧的起點。對于鑒別器,最后的卷積層被抖動,然后被饋送到單個sigmoid輸出中。有關示例模型體系結構的可視化,請參見圖1。
Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient ?ow in deeper models. This proved critical to get deep generators to begin learning, preventing the generator from collapsing all samples to a single point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers however, resulted in sample oscillation and model instability. This was avoided by not applying batchnorm to the generator output layer and the discriminator input layer.
第三是批量標準化(Ioffe&Szegedy,2015),通過將每個單元的輸入標準化為零均值和單位差異來穩定學習。這有助于處理由于初始化較差而出現的培訓問題,并幫助深層模型中的漸變流。這對于讓深層發生器開始學習非常重要,可以防止發生器將所有樣品壓縮到單個點,這是GAN中觀察到的常見故障模式。然而,直接將蝙蝠applying applying應用于所有層,導致樣品振蕩和模型不穩定。這是通過不將蝙蝠chnorm應用于發生器輸出層和鑒別器輸入層而避免的。
The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output layer which uses the Tanh function. We observed that using a bounded activation allowed the model to learn more quickly to saturate and cover the color space of the training distribution. Within the discriminator we found the leaky recti?ed activation (Maas et al., 2013) (Xu et al., 2015) to work well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which used the maxout activation (Goodfellow et al., 2013).
ReLU激活(Nair&Hinton,2010)用于發生器,但使用Tanh函數的輸出層除外。我們觀察到,使用有界激活可使模型更快地學習,以飽和并覆蓋訓練分布的色彩空間。在鑒別器內部,我們發現泄漏整流激活(Maas et al。,2013)(Xu et al。,2015)能夠很好地工作,尤其是對于更高分辨率的建模。這與使用最大激活的原始GAN紙相反(Goodfellow等,2013)。
4 DETAILS OF ADVERSARIAL TRAINING
4不良訓練的詳情
We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015), Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets are given below.
我們在三個數據集(大規模場景理解(LSUN)(Yu等,2015),Imagenet-1k和新組裝的Faces數據集)上訓練DCGAN。下面給出了每個數據集的使用細節。
No pre-processing was applied to training images besides scaling to the range of the tanh activation function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum termat the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training.
除了縮放至tanh激活函數[-1,1]的范圍外,沒有預處理應用于訓練圖像。所有模型均采用小批量隨機梯度下降(SGD)進行培訓,最小批量為128。所有權重均從零中心正態分布初始化,標準偏差為0.02。在LeakyReLU中,所有型號的泄漏斜率設置為0.2。盡管以前的GAN工作利用動力來加速培訓,但我們使用了具有調整超參數的Adam優化器(Kingma&Ba,2014)。我們發現建議的學習率為0.001,過高,使用0.0002代替。此外,我們發現將動量項保持在0.9的建議值,導致訓練振蕩和不穩定性,同時將其降至0.5,這有助于穩定訓練。
pixel image. Notably, no fully connected or pooling layers are used.*
*圖1:用于LSUN場景建模的DCGAN發生器。100維均勻分布Z被投影到具有許多特征映射的小空間范圍卷積表示。一系列四個分步式卷積(在最近的一些論文中,這些被錯誤地稱為反卷積),然后將這種高級表示轉換成像素圖像。值得注意的是,沒有使用完全連接或合并層。*
4.1 LSUN
4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-?tting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples. Recent analysis has shown that there is a direct link between how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality samples via simply over?tting/memorizing training examples. No data augmentation was applied to the images.
隨著生成圖像模型樣本的視覺質量的提高,培訓樣本的覆蓋和記憶問題日益突出。為了演示我們的模型如何隨著更多數據和更高分辨率的生成而擴展,我們在包含300多萬個訓練樣例的LSUN臥室數據集上訓練模型。最近的分析表明,模型學習的速度與泛化性能之間存在直接聯系(Hardt等,2015)。除了收斂后的樣本(圖3),我們還展示了來自一個培訓時期(圖2)的樣本,模擬在線學習,以此來證明我們的模型不通過簡單的過度訓練/記憶培訓生成高質量樣本例子。沒有數據增加被應用于圖像。
4.1.1 DEDUPLICATION
4.1.1重復使用
To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We ?t a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication. Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.
為了進一步降低生成器記憶輸入示例的可能性(圖2),我們執行一個簡單的圖像重復刪除過程。我們在32x32下采樣中心作物的訓練實例中提供了一個3072-128-3072去噪退出正則化RELU自編碼器。然后通過對已被證明是有效的信息保存技術的ReLU激活進行閾值化(Srivastava等人,2014),對得到的代碼層激活進行二值化,并提供便利的語義哈希形式,從而實現線性時間重復刪除。哈希碰撞的目視檢查顯示出高精度,估計誤報率小于100。此外,該技術檢測到并刪除了近275,000個重復數據,表明召回率很高。
4.2 FACES
4.2面部
We scraped images containing human faces from random web image queries of peoples names. The people names were acquired from dbpedia, with a criterion that they were born in the modern era. This dataset has 3M images from 10K people. We run an OpenCV face detector on these images, keeping the detections that are suf?ciently high resolution, which gives us approximately 350,000 face boxes. We use these face boxes for training. No data augmentation was applied to the images.
我們從人物名稱的隨機Web圖像查詢中刮取包含人臉的圖像。人名是從dbpedia獲得的,其標準是他們出生在現代時代。該數據集包含來自10K人的3M圖像。我們在這些圖像上運行OpenCV人臉檢測器,保持足夠高分辨率的檢測結果,這為我們提供了大約350,000個面部檢測盒。我們使用這些臉盒進行訓練。沒有數據增加被應用于圖像。
Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model could learn to memorize training examples, but this is experimentally unlikely as we train with a small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating memorization with SGD and a small learning rate.
圖2:一次訓練后產生的臥室通過數據集。從理論上講,該模型可以學習記憶訓練實例,但這在實驗中不太可能,因為我們以小學習率和小批量SGD訓練。我們知道沒有先前的經驗證據表明用SGD和小的學習率記憶。
Figure 3: Generated bedrooms after ?ve epochs of training. There appears to be evidence of visual under-?tting via repeated noise textures across multiple samples such as the base boards of some of the beds.
圖3:經過五個培訓階段后的臥室。似乎有證據表明通過多個樣品(例如某些床的基板)上的重復的噪音紋理,可能會造成視覺損傷。
4.3 IMAGENET-1K
4.3 IMAGENET-1K
We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We train onmin-resized center crops. No data augmentation was applied to the images.
我們使用Imagenet-1k(Deng et al。,2009)作為無監督訓練的自然圖像源。我們在最小尺寸的中心作物上進行訓練。沒有數據增加被應用于圖像。
5 EMPIRICAL VALIDATION OF DCGANS CAPABILITIES
5 DCGANS能力的經驗驗證
5.1 CLASSIFYING CIFAR-10 USING GANS AS A FEATURE EXTRACTOR
5.1使用GANS作為特征提取器對CIFAR-10進行分類
One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models ?tted on top of these features.
評估無監督表示學習算法的質量的一種常用技術是將它們用作受監督數據集上的特征提取器,并評估在這些特征之上擬合的線性模型的性能。
On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm. When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy. An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates & Ng, 2011). To evaluate the quality of the representations learned by DCGANs for supervised tasks, we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce aspatial locations. The performance of DCGANs is still less than that of Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs in an unsupervised fashion to differentiate between speci?cally chosen, aggressively augmented, exemplar samples from the source dataset.Further improvements could be made by ?netuning the discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the learned features.
在CIFAR-10數據集上,從使用K-means作為特征學習算法的良好調諧的單層特征提取流水線中已經證明了非常強的基線性能。當使用大量的特征映射(4800)時,該技術的準確性達到80.6%。基礎算法的無監督多層擴展達到了82.0%的準確性(Coates&Ng,2011)。為了評估DCGAN為監督任務學習的表示的質量,我們在Imagenet-1k上訓練,然后使用所有層的鑒別器的卷積特征,最大化每個層的表示以產生空間位置的許多層,確實導致較大的總特征向量大小。DCGANs的性能仍然低于Exemplar CNN(Dosovitskiy等,2015),該技術以無監督的方式訓練正常的區分性CNN,以區分源數據集中特定選擇的,主動增強的示例性樣本。通過對鑒別器的表示進行網絡化可以進一步改進,但我們將其留作未來工作。此外,由于我們的DCGAN從未在CIFAR-10上進行過培訓,因此本實驗還顯示了學習功能的域穩健性。
Table 1: CIFAR-10 classi?cation results using our pre-trained model. Our DCGAN is not pretrained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.
表1:使用我們的預先訓練的模型的CIFAR-10分類結果。我們的DCGAN不是在CIFAR-10上預訓練的,而是在Imagenet-1k上的,并且這些特征用于對CIFAR-10圖像進行分類。
5.2 CLASSIFYING SVHN DIGITS USING GANS AS A FEATURE EXTRACTOR
5.2使用GANS作為特征提取器來分類SVHN數字
On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of 10,000 examples from the non-extra set and use it for all hyperparameter and model selection. 1000 uniformly class distributed training examples are randomly selected and used to train a regularized linear L2-SVM classi?er on top of the same feature extraction pipeline used for CIFAR-10. This achieves state of the art (for classi?cation using 1000 labels) at 22.48% test error, improving upon another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015). Additionally, we validate that the CNN architecture used in DCGAN is not the key contributing factor of the model’s performance by training a purely supervised CNN with the same architecture on the same data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio, 2012). It achieves a sign?cantly higher 28.87% validation error.
在StreetView House Numbers數據集(SVHN)(Netzer et al。,2011)中,當標記數據稀缺時,我們將DCGAN的鑒別器的特性用于監督目的。按照與CIFAR-10實驗類似的數據集準備規則,我們從非額外集合中分離出10,000個實例的驗證集,并將其用于所有超參數和模型選擇。隨機選擇1000個均勻分布的分布式訓練樣本,并用于在用于CIFAR-10的相同特征提取流水線之上訓練一個正則化的線性L2-SVM分類器。這實現了最先進的技術(用1000個標簽進行分類),測試誤差為22.48%,改進了CNN的另一種修改,旨在利用非標記數據(Zhao et al。,2015)。此外,我們通過在相同數據上訓練具有相同架構的純監督CNN并通過隨機搜索優化該模型超過64個超參數試驗(Bergstra&Bengio),驗證DCGAN中使用的CNN架構不是模型性能的關鍵貢獻因素,2012)。它實現了高達28.87%的驗證錯誤。
6 INVESTIGATING AND VISUALIZING THE INTERNALS OF THE NETWORKS
6調查和可視化網絡內部
We investigate the trained generators and discriminators in a variety of ways. We do not do any kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.
我們以各種方式調查受過訓練的發生器和鑒別器。我們不在訓練集上進行任何類型的最近鄰搜索。通過小圖像變換,像素或特征空間中最近的鄰居被平凡地愚弄(Theis et al。,2015)。我們也不使用對數似然度量來定量評估模型,因為它是一個很差的(Theis et al。,2015)度量。
Table 2: SVHN classi?cation with 1000 labels
表2:具有1000個標簽的SVHN分類
6.1 WALKING IN THE LATENT SPACE
6.1在潛在空間中行走
The ?rst experiment we did was to understand the landscape of the latent space. Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations. The results are shown in Fig.4.
我們做的第一個實驗是了解潛在空間的景觀。在學習的流形中行走通??梢愿嬖V我們關于記憶的跡象(如果存在劇烈的過渡)以及空間分層崩潰的方式。如果在這個潛在空間中行走導致圖像世代發生語義變化(例如添加和刪除的對象),我們可以推斷該模型已經學習了相關的和有趣的表示。結果如圖4所示。
6.2 VISUALIZING THE DISCRIMINATOR FEATURES
6.2可視化辨別器功能
Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on scene classi?cation learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting.Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows. For comparison, in the same ?gure, we give a baseline for randomly initialized features that are not activated on anything that is semantically relevant or interesting.
以前的工作已經證明,對大圖像數據集進行有監督的CNN培訓會產生非常強大的學習功能(Zeiler&Fergus,2014)。此外,受監督的CNN在場景分類方面進行了培訓,學習了物體探測器(Oquab等,2014)。我們證明在大圖像數據集上訓練的無監督DCGAN也可以學習有趣的功能層次結構。使用(Springenberg et al。,2014)提出的引導式反向傳播,我們在圖5中顯示,鑒別器學習的特征在臥室的典型部分(如床和窗)上激活。為了比較,在同一圖中,我們給出了隨機初始化特征的基線,這些特征在語義上相關或有趣的任何事物上都未被激活。
6.3 MANIPULATING THE GENERATOR REPRESENTATION
6.3操縱發電機代表
6.3.1 FORGETTING TO DRAW CERTAIN OBJECTS
6.3.1忘記吸取某些物體
In addition to the representations learnt by a discriminator, there is the question of what representations the generator learns. The quality of samples suggest that the generator learns speci?c object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furniture. In order to explore the form that these representations take, we conducted an experiment to attempt to remove windows from the generator completely.
除了鑒別者學習的表示之外,還有一個關于生成器學習表示的問題。樣本的質量表明,發生器學習了主要場景組件的特定對象表示,例如床,窗戶,燈,門和其他家具。為了探索這些表示所采用的形式,我們進行了一個試驗,試圖從發生器中完全刪除窗口。
On 150 samples, 52 window bounding boxes were drawn manually. On the second highest convolution layer features, logistic regression was ?t to predict whether a feature activation was on a window (or not), by using the criterion that activations inside the drawn bounding boxes are positives and random samples from the same images are negatives. Using this simple model, all feature maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then, random new samples were generated with and without the feature map removal.
在150個樣本上,手動繪制了52個窗口邊界框。在第二高的卷積層特征上,邏輯回歸用于預測特征激活是否在窗口上(通過使用標準,即繪制的邊界框內的激活是肯定的并且來自相同圖像的隨機樣本是否定的)。使用這個簡單模型,從所有空間位置刪除所有權重大于零(總共200個)的特征地圖。然后,在有和沒有去除特征圖的情況下生成隨機新樣本。
The generated images with and without the window dropout are shown in Fig.6, and interestingly, the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.
圖6顯示了帶有或不帶有窗口丟失的生成圖像,并且有趣的是,網絡大多忘記在臥室中繪制窗戶,用其他物體代替它們。
Figure 4: Top rows: Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In the 6th row, you see a room without a window slowly transforming into a room with a giant window. In the 10th row, you see what appears to be a TV slowly being transformed into a window.
圖4:頂行:Z中的一系列9個隨機點之間的插值表明,學習的空間具有平滑的過渡,空間中的每個圖像看起來都像一間臥室。在第六排,你看到一個沒有窗戶的房間慢慢變成一個有巨大窗戶的房間。在第十行中,你會看到電視正慢慢變成一扇窗戶。
6.3.2 VECTOR ARITHMETIC ON FACE SAMPLES
6.3.2矢量在面部樣本上的算術運算
In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated that simple arithmetic operations revealed rich linear structure in representation space. One canonical example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors of sets of exemplar samples for visual concepts. Experiments working on only single samples per concept were unstable, but averaging the Z vector for three examplars showed consistent and stable generations that semantically obeyed the arithmetic. In addition to the object manipulation shown in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8).
在評估詞匯的學習表征(Mikolov等,2013)中,證明了簡單的算術運算揭示了表征空間中豐富的線性結構。一個典型的例子表明,矢量(“國王”) - 矢量(“人”)+矢量(“女人”)產生了一個矢量,其最近的鄰居是女王的矢量。我們調查了在我們的發電機的Z表示中是否出現類似的結構。我們對視覺概念的示例樣本集的Z向量執行類似的算術。每個概念僅對單個樣本進行實驗的實驗是不穩定的,但對三個樣本的平均Z向量顯示了語義上服從算術的一致且穩定的世代。除了(圖7)所示的對象操作外,我們還證明了在Z空間中線性模擬人臉姿態(圖8)。
These demonstrations suggest interesting applications can be developed using Z representations learned by our models. It has been previously demonstrated that conditional generative models can learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al., 2014). This is to our knowledge the ?rst demonstration of this occurring in purely unsupervised models. Further exploring and developing the above mentioned vector arithmetic could dramatically reduce the amount of data needed for conditional generative modeling of complex image distributions.
這些演示表明可以使用我們的模型學習到的Z表示來開發有趣的應用程序。先前已經證明,條件生成模型可以學會令人信服地模擬對象屬性,如規模,旋轉和位置(Dosovitskiy et al。,2014)。這是我們的知識,這是純粹無監督模型中的第一次演示。進一步探索和開發上述向量算法可以顯著減少復雜圖像分布的條件生成建模所需的數據量。
Figure 5: On the right, guided backpropagation visualizations of maximal axis-aligned responses for the ?rst 6 learned convolutional features from the last convolution layer in the discriminator. Notice a signi?cant minority of features respond to beds - the central object in the LSUN bedrooms dataset. On the left is a random ?lter baseline. Comparing to the previous responses there is little to no discrimination and random structure.
圖5:在右側,針對來自鑒別器中最后卷積層的前6個學習卷積特征的最大軸對齊響應的反向傳播可視化。注意一些重要特征對床的響應 - LSUN臥室數據集中的中心對象。左邊是一個隨機過濾器基線。與之前的回應相比,幾乎沒有歧視和隨機結構。
Figure 6: Top row: un-modi?ed samples from model. Bottom row: the same samples generated with dropping out ”window” ?lters. Some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation. Extended experiments could be done to remove other objects from the image and modify the objects the generator draws.
圖6:頂行:來自模型的未修改樣本。底行:通過刪除“窗口”過濾器生成相同的樣本。有些窗戶被拆除,其他窗戶被轉換成具有類似視覺外觀的物體,如門和鏡子。盡管視覺質量下降,但整體場景構成保持相似,這表明生成器已經從對象表示中很好地解開了場景表示??梢赃M行擴展實驗來從圖像中移除其他對象并修改生成器繪制的對象。
7 CONCLUSION AND FUTURE WORK
7結論和未來工作
We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling. There are still some forms of model instability remaining - we noticed as models are trained longer they sometimes collapse a subset of ?lters to a single oscillating mode.
我們提出了一套更穩定的架構來訓練生成對抗網絡,并且我們給出證據表明敵對網絡學習了監督學習和生成建模的良好圖像表示。仍然存在一些形式的模型不穩定性 - 我們注意到隨著模型訓練時間更長,它們有時會將一部分濾波器折疊成單個振蕩模式。
Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are averaged.Arithmetic was then performed on the mean vectors creating a new vector Y. The center sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate the interpolation capabilities of the generator, uniform noise sampled with scale +-0.25 was added to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples) results in noisy overlap due to misalignment.
圖7:視覺概念的矢量算法。對于每列,對樣本的Z向量進行平均。然后對均值向量進行算術運算,創建一個新的向量Y.右側的中心樣品是通過將Y作為輸入發送到發生器而生產的。為了演示發生器的內插能力,將采用比例+ -0.25采樣的均勻噪聲添加到Y以產生另外8個采樣。在輸入空間中應用算術(下面的兩個示例)會導致由于未對齊而產生的噪音重疊。
Further work is needed to tackle this from of instability. We think that extending this framework to other domains such as video (for frame prediction) and audio (pre-trained features for speech synthesis) should be very interesting. Further investigations into the properties of the learnt latent space would be interesting as well.
需要進一步的工作來解決不穩定因素。我們認為將這個框架擴展到視頻(用于幀預測)和音頻(用于語音合成的預先訓練的特征)等其他領域應該是非常有趣的。對學習的潛在空間的屬性的進一步研究也會很有趣。
Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking right.By adding interpolations along this axis to random samples we were able to reliably transform their pose.
圖8:一個“轉向”矢量是從四個平均的面向左看與右看樣本創建的。通過沿這個軸插入隨機樣本,我們能夠可靠地轉換它們的姿態。
文章引用于 http://tongtianta.site/paper/351
編輯 Lornatang
校準 Lornatang