DCGANs_Paper(翻譯)

說明

區塊如下分類，原文區塊為藍底，翻譯區塊為綠底

原文

翻譯

任何的翻譯不通暢部份都請留言指導

ABSTRACT

In recent years, supervised learning with convolutional networks (CNNs) has
seen huge adoption in computer vision applications. Comparatively, unsupervised
learning with CNNs has received less attention. In this work we hope to help
bridge the gap between the success of CNNs for supervised learning and unsupervised learning.

We introduce a class of CNNs called deep convolutional generative
adversarial networks (DCGANs), that have certain architectural constraints, and
demonstrate that they are a strong candidate for unsupervised learning.

Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator.

Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

近幾年，使用卷積網路(CNNs)的監督式學習在影像辨識應用上已被廣泛採用，相比之下，使用CNNs的非監督式學習受到的觀注較少，在這個研究項目中，我們希望可以縮短CNNs在監督式學習與非監督式學習在成功之間的距離。

我們介紹一個CNNs類別，稱為深度卷積對抗式生成網路(Deep Convolutional Generative Adversarial Networks)DCGANs，它具有某些架構約束，並說明它們是非監督式學習的強力候選人。

在各種資料集上的訓練，我們展示了令人信服的證據，證明在Generator和Discriminator中，我們的深卷積對抗學習了從對象部份到場景的層次表示。

另外，我們將學習到的特徵應用於新任務上，這說明著它們作為一般圖像表示的適用性。

1 INTRODUCTION

Learning reusable feature representations from large unlabeled datasets has been an area of active research. In the context of computer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermediate representations, which can then be used on a variety of supervised learning tasks such as image classification.

從大量未標記的資料中學習可重覆應用的特徵表示一直是一個積極研究的領域，在影像辨識領域中，人們可以利用幾乎無限量的未標記圖像與影片來學習良好中間表示，這可以被應用於各種不同的監督式學習任務，如影像辨識。

We propose that one way to build good image representations is by training Generative Adversarial Networks (GANs) (Goodfellow et al., 2014), and later reusing parts of the generator and discriminator networks as feature extractors for supervised tasks. GANs provide an attractive alternative to maximum likelihood techniques.

我們提出一個方法來建立好的影像表示，就是通過訓練對抗式生成網路(GANs)(Ian Goodfellow et al.,2014)，然後將Generator和Discriminator網路的一部分重新用於監督任務的特徵提取器，GANs替最大似然概率提供了一個吸引人的替代方案

One can additionally argue that their learning process and the lack of a heuristic cost function (such as pixel-wise independent mean-square error) are attractive to representation learning. GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs.

人們可以爭辯，他們的學習過程與缺乏啟發性的成本函數(如像素獨立均方誤差)對表示學習_{(representation learning)}是很有吸引力的。GANs是出了名的訓練不穩定，時常導致Generator產生無意義的輸出。

There has been very limited published research in trying to understand and visualize what GANs learn, and the intermediate representations of multi-layer GANs.

在試著理解與可視化GANs學到什麼以及多層GANs的中間表示方面，已發表的研究非常有限。

In this paper, we make the following contributions

We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)
We use the trained discriminators for image classification tasks, showing competitive performance with other unsupervised algorithms.
We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.
We show that the generators have interesting vector arithmetic properties allowing for easy manipulation of many semantic qualities of generated samples.

在這篇論文中，我們做出以下貢獻：

我們提出並評估一組在卷積GANs上的架構拓樸約束，這些約束使得它們可以在大多數設置情況下穩定訓練。我們將這個架構命名為深度卷積對抗式生成網路_{(Deep Convolutional GANs)}(DCGAN)
我們使用訓練過的Discriminators在影像辨識的任務上，說明與其它非監督式學習的效能差別
我們可視化透過GANs訓練過的filter，並憑經驗表明特定的filter已學會繪製特定的對象
我們說明Generator擁有著有趣的向量演算法特性,便於操作生成的樣本的許多語義特性

2.1 REPRESENTATION LEARNING FROM UNLABELED DATA

Unsupervised representation learning is a fairly well studied problem in general computer vision research, as well as in the context of images. A classic approach to unsupervised representation learning is to do clustering on the data (for example using K-means), and leverage the clusters for improved classification scores.

非監督式表示學習在一般電腦視覺研究以及影像領域中是一個相當好研究的問題。一種經典非監督式表示學習的方法就是對資料做聚類(如K-means)，並利用聚類來提高分類得分。

In the context of images, one can do hierarchical clustering of image patches (Coates & Ng, 2012) to learn powerful image representations. Another popular method is to train auto-encoders (convolutionally, stacked (Vincent et al., 2010), separating the what and where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015)) that encode an image into a compact code, and decode the code to reconstruct the image as accurately as possible.These methods have also been shown to learn good feature representations from image pixels.Deep belief networks (Lee et al., 2009) have also been shown to work well in learning hierarchical representations.

在影像領域中，人們可以對圖像修補程式進行分層聚類(Coates & Ng,2012)以學習強大的圖像表示。另一種受歡迎的作法，就是訓練auto-encoder(convolutionally, stacked (Vincent et al., 2010), separating the what and where components of the code (Zhao et al., 2015), ladder structures (Rasmus et al., 2015))，將影像編碼為一個緊湊的代碼，並將該代碼解碼，盡可能的接近原始圖像。這種方法也被證明可以從影像像素中學習到好的表徵表示。深度信念網路(Lee et al., 2009)也被證明可以在學習結構表示上可以有好的表示

2.2 GENERATING NATURAL IMAGES

Generative image models are well studied and fall into two categories: parametric and nonparametric. The non-parametric models often do matching from a database of existing images, often matching patches of images, and have been used in texture synthesis (Efros et al., 1999), super-resolution (Freeman et al., 2002) and in-painting (Hays & Efros, 2007). Parametric models for generating images has been explored extensively (for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000)).

生成影像模型經過充份的研究並分成兩個類別：參數與非參數。非參數模型通常從既有已存在的影像中進行匹配，通常匹配影像的區塊，已被應用於紋理合成(Efros et al., 1999)，超分辨率(Freeman et al., 2002)與繪圖(Hays & Efros, 2007)。應用於生成圖像的參數模型已被廣泛探討(for example on MNIST digits or for texture synthesis (Portilla & Simoncelli, 2000))。

However, generating natural images of the real world have had not much success until recently. A variational sampling approach to generating images (Kingma & Welling, 2013) has had some success, but the samples often suffer from being blurry. Another approach generates images using an iterative forward diffusion process (Sohl-Dickstein et al., 2015). Generative Adversarial Networks (Goodfellow et al., 2014) generated images suffering from being noisy and incomprehensible. A laplacian pyramid extension to this approach (Denton et al., 2015) showed higher quality images, but they still suffered from the objects looking wobbly because of noise introduced in chaining multiple models. A recurrent network approach (Gregor et al., 2015) and a deconvolution network approach (Dosovitskiy et al., 2014) have also recently had some success with generating natural images. However, they have not leveraged the generators for supervised tasks.

然而，時至近日，生成真實世界的自然影像依然沒有太大的成功。一種變異性抽樣生成影像(Kingma & Welling, 2013)取得一些進展，但它的採樣通常是模糊不清。另一種實現影像生成的方法就是使用迭代前饋擴散過程(Sohl-Dickstein et al., 2015)。生成對抗式網路(Goodfellow et al., 2014)生成的影像受到嘈雜並難以理解。這種方法的拉普拉斯金字塔延伸(Denton et al., 2015)呈現了更高品質的照片，但因為在連結多個模型中引入的噪音，它們仍然受到物體看起來搖擺不定的影響。遞迴網路(Gregor et al., 2015)實現與反卷積(Dosovitskiy et al., 2014)實現生成自然影像也在近期取得初步成果。然而他們並沒有將Generator應用在監督式任務上。

2.3 VISUALIZING THE INTERNALS OF CNNS

One constant criticism of using neural networks has been that they are black-box methods, with little understanding of what the networks do in the form of a simple human-consumable algorithm. In the context of CNNs, Zeiler et. al. (Zeiler & Fergus, 2014) showed that by using deconvolutions and filtering the maximal activations, one can find the approximate purpose of each convolution filter in the network. Similarly, using a gradient descent on the inputs lets us inspect the ideal image that activates certain subsets of filters (Mordvintsev et al.).

人們對神經網路一直以來的批評就是它們是一個黑盒子，對神經網路在simple human-consumable的作用知之甚少。在CNNs領域中，Zeiler et. al. (Zeiler & Fergus, 2014)表明，透過反卷積與濾波器最大化啟動，人們可以大致瞭解每一個卷積濾波器在神經網路上的作用。同樣的，對輸入使用梯度下降讓我們檢查啟動某些濾波器子集的理想影像(Mordvintsev et al.)

3 APPROACH AND MODEL ARCHITECTURE

Historical attempts to scale up GANs using CNNs to model images have been unsuccessful. This motivated the authors of LAPGAN (Denton et al., 2015) to develop an alternative approach to iteratively upscale low resolution generated images which can be modeled more reliably. We also encountered difficulties attempting to scale GANs using CNN architectures commonly used in the supervised literature.

歷史上，嚐試使用CNNs去構造GANs的作法並不是很成功，這促使LAPGAN的作者(Denton et al., 2015)開發出能在迭代方法下提高解析度層次生成影象的替代方案。我們在嚐試利用CNN結構來架構GANs在監督式應用上也遇到困難。

However, after extensive model exploration we identified a family of architectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models. Core to our approach is adopting and modifying three recently demonstrated changes to CNN architectures.

然後，在經過一番模型摸索之後，我們定義了一系列的架構，這些架構可以在一系列的資料集中穩定訓練，並且允許訓練高解析度而且更深的生成模型。實現的核心方法就是採用並調整三個最近說明CNN架構的三種變化。

The first is the all convolutional net (Springenberg et al., 2014) which replaces deterministic spatial pooling functions (such as maxpooling) with strided convolutions, allowing the network to learn its own spatial downsampling. We use this approach in our generator, allowing it to learn its own spatial upsampling, and discriminator.

首先是all convolutional net_{(全卷積層)}(Springenberg et al., 2014)，它以步幅_(stride)卷積來替代確定性的分離池化函數(such as maxpooling)，允許網路自己學習空間的下採樣_{(downsampling)}。我們使用這種方式來實現Generator，允許網路自己學習上採樣_(upsampling)與Discriminator。

Second is the trend towards eliminating fully connected layers on top of convolutional features. The strongest example of this is global average pooling which has been utilized in state of the art image classification models (Mordvintsev et al.). We found global average pooling increased model stability but hurt convergence speed. A middle ground of directly connecting the highest convolutional features to the input and output respectively of the generator and discriminator worked well.

其次是在卷積特徵最上層消除完全連接層_{(fully connected layers)}的趨勢。最強而有力的範例就是使用全域平均池化_{(global average pooling)}，那已被應用於最先進的影像辨識模型。我們發現global average pooling增加了模型的穩定性，但犧牲了收斂的速度。折衷的作法是直接將最高層的卷積特徵與輸入連接起來，而Generator與Discriminator各自輸出，效果不錯。

The first layer of the GAN, which takes a uniform noise distribution Z as input, could be called fully connected as it is just a matrix multiplication, but the result is reshaped into a 4-dimensional tensor and used as the start of the convolution stack. For the discriminator, the last convolution layer is flattened and then fed into a single sigmoid output. See Fig. 1 for a visualization of an example model architecture.

GAN的第一層會以平均分佈的噪點Z做為輸入，這可以稱為fully connected，它只是矩陣乘法，但結果要重新調整維度至4-dimensional tensor_{(四維張量NHWC)}，並被做為卷積堆疊的起始。對Discriminator而言，最後一層卷積層平展_(flatten)之後送入Sigmoid function輸出。(See Fig. 1 for a visualization of an example model architecture. )

Third is Batch Normalization (Ioffe & Szegedy, 2015) which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance. This helps deal with training problems that arise due to poor initialization and helps gradient flow in deeper models.

第三是Batch Normalization_{(BN Layer)}(Ioffe & Szegedy, 2015)，它通過將每個神經單元的輸入標準化為零均值和單位方差來穩定學習。這有助於解決因為初始化不良而導致的訓練問題，並有助於較深模型中的梯度流動。

This proved critical to get deep generators to begin learning, preventing the generator from collapsing all samples to a single point which is a common failure mode observed in GANs. Directly applying batchnorm to all layers however, resulted in sample oscillation and model instability.

事實證明這對deep generator開始訓練是非常關鍵的，預防Generator將所有的抽樣往一個點傾斜，這是GANs上常見的失敗。然而，直接對所有的網路層做BN將導致抽樣震盪以及模型的不穩定。

個人理解上面提到的是，Generator學習過於集中在某一個Ganssian Distribution上，這會造成生成的照片多數相似。

This was avoided by not applying batchnorm to the generator output layer and the discriminator input layer. The ReLU activation (Nair & Hinton, 2010) is used in the generator with the exception of the output layer which uses the Tanh function. We observed that using a bounded activation allowed the model to learn more quickly to saturate and cover the color space of the training distribution. Within the discriminator we found the leaky rectified activation (Maas et al., 2013) (Xu et al., 2015) to work well, especially for higher resolution modeling. This is in contrast to the original GAN paper, which used the maxout activation (Goodfellow et al., 2013).

要避免在Generator的輸出層以及Discriminator的輸入層實作BN。Generator的輸出層以外使用relu，而輸出層則使用tanh做為啟動函數。我們觀察到，使用有界的啟動函數_{(bounded activation)}有助於模型迅速在訓練分佈中覆蓋顏色空間。在Discriminator中，我們發現leaky rectified activation(Maas et al., 2013) (Xu et al., 2015)效果很好，特別是針對更高解析度的建模。這是跟原始GAN論文的比較，而原始GAN中使用了maxout activation(Goodfellow et al., 2013)

Architecture guidelines for stable Deep Convolutional GANs
• Replace any pooling layers with strided convolutions (discriminator) and fractional-strided convolutions (generator).
• Use batchnorm in both the generator and the discriminator.
• Remove fully connected hidden layers for deeper architectures.
• Use ReLU activation in generator for all layers except for the output, which uses Tanh.
• Use LeakyReLU activation in the discriminator for all layers.

穩定的DCGAN架構指南：

以步幅卷積_{(strided convolutions)}取代所有的池化層(discriminator)，並使用微步卷積_{(fractional stride convolution)}(generator)
- 可參考知乎討論
- strided convolutions：較大步幅的卷積
- fractional stride convolution：轉置卷積
  - keras.layers.onv2DTranspose
在Generator與Discriminator使用BatchNorm
移除掉深度結構中的屬fully connected的隱藏層
Generator使用relu，除了最後的輸出層使用tanh
Discriminator都使用LeakyRelu

4 DETAILS OF ADVERSARIAL TRAINING

We trained DCGANs on three datasets, Large-scale Scene Understanding (LSUN) (Yu et al., 2015), Imagenet-1k and a newly assembled Faces dataset. Details on the usage of each of these datasets are given below

我們在三個資料集上訓練DCGANs，Large-scale Scene Understanding_{(大規模場景理解)}(LSUN)，Imagenet-1k以及新創的臉部資料集。下面給定每個資料的使用細節。

No pre-processing was applied to training images besides scaling to the range of the tanh activation function [-1, 1]. All models were trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128. All weights were initialized from a zero-centered Normal distribution with standard deviation 0.02. In the LeakyReLU, the slope of the leak was set to 0.2 in all models. While previous GAN work has used momentum to accelerate training, we used the Adam optimizer (Kingma & Ba, 2014) with tuned hyperparameters. We found the suggested learning rate of 0.001, to be too high, using 0.0002 instead. Additionally, we found leaving the momentum term β1 at the suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped stabilize training.

除了縮放資料至tanh activation function[-1, 1]之外，沒有對訓練照片做預處理。所有的模型皆以小批量隨機梯度下降_(SGD)訓練，batch-size設置為128。所有的權重皆以零為中心點的正態分佈初始化，標準差設置為0.02。所有模型中的LeakyRelu皆設置slope of the leak為0.2。雖然之前的GAN已經使用momentum來加速訓練，但我們使用Adam最佳化(Kingma & Ba, 2014)和調整參數。我們發現，Adam建議的學習效率0.001太高，因此使用0.0002。另外，我們發現
保持Adam動量項目 $\beta1$ 在建議值0.9的話會導致訓練振盪和不穩定，調降為0.5的時候有助於穩定訓練。

Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribution Z is projected to a small spatial extent convolutional representation with many feature maps.

A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no fully connected or pooling layers are used.

Figure 1: DCGAN生成器使用使用LSUN場景資料建模。將100維均勻分佈Z投影到具有許多特徵圖的小空間範圍卷積表示。

連續的4個fractionally-strided convolutions_{(微步卷積)}(最近的論文中，這被錯誤的解讀為反卷積)，將這高階表示轉換成64x64-Pixel的影像。值得注意的是，沒有使用任何的fully connected或pooling layers

4.1 LSUN

As visual quality of samples from generative image models has improved, concerns of over-fitting and memorization of training samples have risen. To demonstrate how our model scales with more data and higher resolution generation, we train a model on the LSUN bedrooms dataset containing a little over 3 million training examples.

Recent analysis has shown that there is a direct link between how fast models learn and their generalization performance (Hardt et al., 2015). We show samples from one epoch of training (Fig.2), mimicking online learning, in addition to samples after convergence (Fig.3), as an opportunity to demonstrate that our model is not producing high quality samples via simply overfitting/memorizing training examples. No data augmentation was applied to the images.

隨著從Generator生成的照片品質不斷提高，對訓練樣本的過擬合與模型的記憶效果的擔憂也逐漸提升。為了說明我們的模型如何隨著更多資料與更高分辨率的生成而擴展，我們在LSUN超過300萬訓練樣本的臥室資料集上訓練模型。

最近的分析表明，模型學習速度和泛化效能之間存在著直接關係(Hardt et al., 2015)。我們展示訓練一次迭代的樣本(Fig.2)，模仿線上學習，加上收斂之後的採樣(Fig.3)，這是一個證明我們的模型不是通過簡單過度擬合/記憶訓練來產出高質量樣本的機會。沒有對影像實作資料增強。

4.1.1 DEDUPLICATION

To further decrease the likelihood of the generator memorizing input examples (Fig.2) we perform a simple image de-duplication process. We fit a 3072-128-3072 de-noising dropout regularized RELU autoencoder on 32x32 downsampled center-crops of training examples. The resulting code layer activations are then binarized via thresholding the ReLU activation which has been shown to be an effective information preserving technique (Srivastava et al., 2014) and provides a convenient form of semantic-hashing, allowing for linear time de-duplication. Visual inspection of hash collisions showed high precision with an estimated false positive rate of less than 1 in 100. Additionally, the technique detected and removed approximately 275,000 near duplicates, suggesting a high recall.

為了進一步降低Generator記憶輸入樣本的可能性(Fig.2)，我們執行了一個簡單的去除重覆影像的程序。我們在32x32下採樣中心區域訓練資料上訓練了一個3072-128-3072 de-noising dropout regularized RELU的自動編碼器_{(autoencoder)}。然後通過Relu做閥值處理，將得到的編碼二值化，這已經被證明是一種有效的資訊保存技術(Srivastava et al., 2014)，並提供了一種方便的語義雜湊形式，允許線性時間重復資料消除。目視檢查雜湊碰撞顯示高精度，fp_{(false postive假陽)}小於 $\dfrac{1}{100}$ 。此外，該技術檢測並移除大約275,000個接近重覆項目，這意味著高召回率_(recall)

4.2 FACES

We scraped images containing human faces from random web image queries of peoples names. The people names were acquired from dbpedia, with a criterion that they were born in the modern era. This dataset has 3M images from 10K people. We run an OpenCV face detector on these images, keeping the detections that are sufficiently high resolution, which gives us approximately 350,000 face boxes. We use these face boxes for training. No data augmentation was applied to the images

我們用人名隨機從網頁搜集包含人臉的照片。姓名是從dbpedia取得，有一個標準是，他們是出生於現在的現代人。這資料集有3,000,000張照片，來自10,000人員。我們在這些照片上用Opencv做人臉偵測，保持高解析度的檢測，這給了我們將近350,000臉部框。我們使用這些臉部框做訓練，沒有執行任何的資料增強在這些照片上。

Figure 2: Generated bedrooms after one training pass through the dataset. Theoretically, the model could learn to memorize training examples, but this is experimentally unlikely as we train with a small learning rate and minibatch SGD. We are aware of no prior empirical evidence demonstrating memorization with SGD and a small learning rate.

Figure 2: 透過資料集訓練一次迭代生成的臥室照片。理論上，模型可以學習到記憶訓練資料，但是這個實驗不大可能，因為我們用較小的學習效率以及小批量SGD來訓練模型。

Figure 3: Generated bedrooms after five epochs of training. There appears to be evidence of visual under-fitting via repeated noise textures across multiple samples such as the base boards of some of the beds.

Figure 3: 訓練迭代五次之後生成的臥室照片。似乎有證據顯示，在多個樣品(像是某些床上的基板)上通過重覆的噪點紋理而視覺欠擬合。

4.3 IMAGENET-1K

We use Imagenet-1k (Deng et al., 2009) as a source of natural images for unsupervised training. We train on 32 × 32 min-resized center crops. No data augmentation was applied to the images.

我們使用Imagenet-1k(Deng et al., 2009)做為未監督式訓練的自然影像來源。我們將影像重新調整至32x32並取中間區域。沒有在影像上執行資料增強。

5 EMPIRICAL VALIDATION OF DCGANS CAPABILITIES

5.1 CLASSIFYING CIFAR-10 USING GANS AS A FEATURE EXTRACTOR

One common technique for evaluating the quality of unsupervised representation learning algorithms is to apply them as a feature extractor on supervised datasets and evaluate the performance of linear models fitted on top of these features.

評估非監督式學習演算法的常見作法就是將它們做為特徵提取器應用在監督式資料集上，並評估擬合在這些特徵上的線性模型的效能。

On the CIFAR-10 dataset, a very strong baseline performance has been demonstrated from a well tuned single layer feature extraction pipeline utilizing K-means as a feature learning algorithm. When using a very large amount of feature maps (4800) this technique achieves 80.6% accuracy. An unsupervised multi-layered extension of the base algorithm reaches 82.0% accuracy (Coates & Ng, 2011).

在CIFAR-10資料集上，利用K-means作為特徵學習算法通過調校好的單層特徵提取管道，已經證明了非常強的基線性能。當使用非常大量的特徵圖_{(feature maps)}(4800)的時候，這個技術達到準確率80.6%。基本演算法的非監督式多層隱藏層可以達到82%的準確率。(Coates & Ng, 2011)

To evaluate the quality of the representations learned by DCGANs for supervised tasks, we train on Imagenet-1k and then use the discriminator’s convolutional features from all layers, maxpooling each layers representation to produce a 4 × 4 spatial grid. These features are then flattened and concatenated to form a 28672 dimensional vector and a regularized linear L2-SVM classifier is trained on top of them.

為了評估DCGAN對監督式的表示學習的品質，我們在Imagenet-1K訓練，並使用來自Discriminator的所有層的卷積特徵，最大池化每一層的表示，以產生4x4的空間網格。這些特徵接著被平展串接為28672維度向量，然後以正規化的L2-SVM線性分類器訓練。

This achieves 82.8% accuracy, out performing all K-means based approaches. Notably, the discriminator has many less feature maps (512 in the highest layer) compared to K-means based techniques, but does result in a larger total feature vector size due to the many layers of 4 × 4 spatial locations. The performance of DCGANs is still less than that of Exemplar CNNs (Dosovitskiy et al., 2015), a technique which trains normal discriminative CNNs in an unsupervised fashion to differentiate between specifically chosen, aggressively augmented, exemplar samples from the source dataset.

這達到82.8%的準確率，超過所有以K-means實現的方法。值得注意的是，相較K-means為基礎的技術，Discriminator具有更少的特徵圖(最高層為512)，但是由於4×4個空間位置的多個層，確實導致更大的總特徵向量大小。DCGAN的效能仍然低於Exemplar CNNs(Dosovitskiy et al., 2015)，該技術以非監督方法訓練正常判別的CNNs，以便從來源資料集區分特別選擇、積極增強的範例樣本。

Further improvements could be made by finetuning the discriminator’s representations, but we leave this for future work. Additionally, since our DCGAN was never trained on CIFAR-10 this experiment also demonstrates the domain robustness of the learned features.

可以透過調校Discriminator的表示來進一步提高效能，但我們保留到未來再作。此外，我們的DCGAN從未在CIFAR-10資料集上訓練，這個實驗也說明學習特徵的領域魯棒性。

Table 1: CIFAR-10 classification results using our pre-trained model. Our DCGAN is not pretrained on CIFAR-10, but on Imagenet-1k, and the features are used to classify CIFAR-10 images.

_{(上圖為Table 1)}
Table 1: CIFAR-10使用我們的預訓練模型分類結果。我們的DCGAN並沒有在CIFAR-10資料集上訓練過，只在Imagenet-1k，用Imagenet-1k訓練出來的特徵來分類CIFAR-10的照片

5.2 CLASSIFYING SVHN DIGITS USING GANS AS A FEATURE EXTRACTOR

On the StreetView House Numbers dataset (SVHN)(Netzer et al., 2011), we use the features of the discriminator of a DCGAN for supervised purposes when labeled data is scarce. Following similar dataset preparation rules as in the CIFAR-10 experiments, we split off a validation set of 10,000 examples from the non-extra set and use it for all hyperparameter and model selection.

在StreetView House Numbers dataset (SVHN)(Netzer et al., 2011)資料集上，當標記資料不足情況下，我們使用DCGAN的Discriminator的特徵，應用於監督式任務上。依循類別於CIFAR-10實驗的資料準備規則，我們從資料集_(非額外)取出10,000樣本，用它們來做超參數及模型的選擇。

1000 uniformly class distributed training examples are randomly selected and used to train a regularized linear L2-SVM classifier on top of the same feature extraction pipeline used for CIFAR-10. This achieves state of the art (for classification using 1000 labels) at 22.48% test error, improving upon another modifcation of CNNs designed to leverage unlabled data (Zhao et al., 2015).

Additionally, we validate that the CNN architecture used in DCGAN is not the key contributing factor of the model’s performance by training a purely supervised CNN with the same architecture on the same data and optimizing this model via random search over 64 hyperparameter trials (Bergstra & Bengio, 2012). It achieves a signficantly higher 28.87% validation error.

從類別中均勻選擇1000個隨機樣本，並應用於訓練使用與CIFAR-10所使用的相同特徵提取管理的一個正規化的線性L2-SVM分類器。錯誤率22.48%，這到達當今世上最先進_{(1000個類別的分類任務)}，利用未標記資料改進了另一種CNNs設計 (Zhao et al., 2015)

此外，我們透過在相同資料集上訓練一個相同架構的純監督式CNN，並通過隨機搜尋超過64個超參數試驗優化模型，來驗證DCGAN使用的CNN架構並不是模型效能的關鍵因素(Bergstra & Bengio, 2012)。它實現了明顯更高的28.87%驗證錯誤。

6 INVESTIGATING AND VISUALIZING THE INTERNALS OF THE NETWORKS

We investigate the trained generators and discriminators in a variety of Ways. We do not do any kind of nearest neighbor search on the training set. Nearest neighbors in pixel or feature space are trivially fooled (Theis et al., 2015) by small image transforms. We also do not use log-likelihood metrics to quantitatively assess the model, as it is a poor (Theis et al., 2015) metric.

我們以各種方式調查訓練好的Generator與Discriminator。我們並沒有在訓練資料集上做任何最近鄰搜尋。像素或特徵空間中的最近鄰域被小圖像變換(Theis et al., 2015)瑣碎地愚弄了。我們也沒有使用log-likelihood來量化評估模型，因為它是一個不好的指標(Theis et al., 2015)。

6.1 WALKING IN THE LATENT SPACE

The first experiment we did was to understand the landscape of the latent space. Walking on the manifold that is learnt can usually tell us about signs of memorization (if there are sharp transitions) and about the way in which the space is hierarchically collapsed. If walking in this latent space results in semantic changes to the image generations (such as objects being added and removed), we can reason that the model has learned relevant and interesting representations. The results are shown in Fig.4.

第一個實驗我們去瞭解潛在空間的橫向排法。走在學習的流形上可以告訴我們關於記憶的跡象(如果有突變點的話)以及關於空間階層式分解的方式。如果行走在這些潛在空間會導致影像生成的語意改變(例如物件的新增或移除)，我們可以說明模型已經學習到相關而且有趣的表示。結果如Fig.4所示。

landscape: 參考國家教育研究院-資訊與通信術語辭典

6.2 VISUALIZING THE DISCRIMINATOR FEATURES

Previous work has demonstrated that supervised training of CNNs on large image datasets results in very powerful learned features (Zeiler & Fergus, 2014). Additionally, supervised CNNs trained on scene classification learn object detectors (Oquab et al., 2014). We demonstrate that an unsupervised DCGAN trained on a large image dataset can also learn a hierarchy of features that are interesting. Using guided backpropagation as proposed by (Springenberg et al., 2014), we show in Fig.5 that the features learnt by the discriminator activate on typical parts of a bedroom, like beds and windows. For comparison, in the same figure, we give a baseline for randomly initialized features that are not activated on anything that is semantically relevant or interesting.

之前的工作已經證明，將大量資料以監督式學習-CNNs訓練可以產生非常強大的學習功能(Zeiler & Fergus, 2014)。此外，以監督式學習-CNNs訓練場景分類學習物體偵測(Oquab et al., 2014)。我們證明，非監式學習-DCGAN訓練大型資料集也可以學習到有趣的階層特徵。使用(Springenberg et al., 2014)提出的guids backpropagation_{(簡中譯導向反向傳播)}，我們在Fig.5說明，Discriminator學習到的特徵對臥室的典型部份有反應，像是床或窗戶。為了進行對比，在相同的圖上，我們給出了隨機初始化特徵的基線，這些特徵在語義相關或有趣的任何事物上都沒有沒反應。

6.3 MANIPULATING THE GENERATOR REPRESENTATION

6.3.1 FORGETTING TO DRAW CERTAIN OBJECTS

In addition to the representations learnt by a discriminator, there is the question of what representations the generator learns. The quality of samples suggest that the generator learns specific object representations for major scene components such as beds, windows, lamps, doors, and miscellaneous furniture. In order to explore the form that these representations take, we conducted an experiment to attempt to remove windows from the generator completely.

除了Discriminator學習到的表示之外，還有問題是，Generator學到什麼表示。採樣的質量說明Generator學習到主要場景物件的特定的物體表示，像是床、窗、燈、門與雜項傢俱。為了探索Generator所採用的表示形式，我們進行了一個實驗，嚐試從Generator完全的移除窗戶。

On 150 samples, 52 window bounding boxes were drawn manually. On the second highest convolution layer features, logistic regression was fit to predict whether a feature activation was on a window (or not), by using the criterion that activations inside the drawn bounding boxes are positives and random samples from the same images are negatives. Using this simple model, all feature maps with weights greater than zero ( 200 in total) were dropped from all spatial locations. Then, random new samples were generated with and without the feature map removal.

在150個樣本上，手動繪製52個邊界框。在第二高層的卷積層特徵，邏輯斯迴歸擬合來預測特徵是否對窗戶有反應(或沒有)，通過使用這個準則，啟動函數對邊界框內判定為真，而從相同照片隨機採樣得到為假。使用這個簡單模型，所有特徵圖權重大於零的部份(總計200個)皆從空間位置移除。然後，在有與沒有移除特徵圖情況下隨機生成新樣本。

The generated images with and without the window dropout are shown in Fig.6, and interestingly, the network mostly forgets to draw windows in the bedrooms, replacing them with other objects.

生成的照片，有與沒有丟棄窗戶的部份顯示於Fig.6，有趣的是，神經網路幾乎忘了在臥室繪製窗戶而是以其它物件替代。

Figure 4:
Top rows: Interpolation between a series of 9 random points in Z show that the space learned has smooth transitions, with every image in the space plausibly looking like a bedroom. In the 6th row, you see a room without a window slowly transforming into a room with a giant window. In the 10th row, you see what appears to be a TV slowly being transformed into a window.

Figure 4:
最上面一行：在Z中一系列的九個隨機點插值顯示，學習到的空間是平順轉換，每張照片看起來都像是臥室。第六行，你看到房間從沒有窗戶慢慢的轉換為有大窗戶。第十行，你看到電視慢慢的變成窗戶。

6.3.2 VECTOR ARITHMETIC ON FACE SAMPLES

In the context of evaluating learned representations of words (Mikolov et al., 2013) demonstrated that simple arithmetic operations revealed rich linear structure in representation space. One canonical example demonstrated that the vector(”King”) - vector(”Man”) + vector(”Woman”) resulted in a vector whose nearest neighbor was the vector for Queen. We investigated whether similar structure emerges in the Z representation of our generators. We performed similar arithmetic on the Z vectors of sets of exemplar samples for visual concepts. Experiments working on only single samples per concept were unstable, but averaging the Z vector for three examplars showed consistent and stable generations that semantically obeyed the arithmetic. In addition to the object manipulation shown in (Fig. 7), we demonstrate that face pose is also modeled linearly in Z space (Fig. 8).

在評估單詞學習表示背景中(Mikolov et al., 2013)證明了簡單的數學運算在表示空間中顯示豐富的線性結構。一個典型的範例說明了vector('King')-vector('Man')+vector('Woman')得到的是最近鄰居vector('Queen')。我們調查了相似的結構是否出現在我們的Generator- $Z$ 表示中。我們對視覺概念的示例樣本集的vector- $Z$ 做類似的計算。每個概念只處理單個樣本的實驗是不穩定的，但是三個樣本平均的vector- $Z$ 顯示了在語義上遵循該算法的一致且穩定的世代。除了(Fig. 7)中所示的物件操作外，我們還證明了在 $Z$ 空間中臉部姿態也是線性建模的（Fig. 8）。

These demonstrations suggest interesting applications can be developed using Z representations learned by our models. It has been previously demonstrated that conditional generative models can learn to convincingly model object attributes like scale, rotation, and position (Dosovitskiy et al., 2014). This is to our knowledge the first demonstration of this occurring in purely unsupervised models. Further exploring and developing the above mentioned vector arithmetic could dramatically reduce the amount of data needed for conditional generative modeling of complex image distributions.

這些證明表明了，可以透過我們的模型學習所得的 $Z$ 表示來開發有趣的應用程式。這已經在先前證明過，條件生成模型可以學習令人信服的建模對象屬性，如尺度、角度、位置(Dosovitskiy et al., 2014)。據我們所知，這是第一次在純非監督式模型中證明這一點。進一步的探索及開發上述的向量運算式可以明顯减少複雜影像分佈條件生成建模所需的資料量。

Figure 5: On the right, guided backpropagation visualizations of maximal axis-aligned responses for the first 6 learned convolutional features from the last convolution layer in the discriminator. Notice a significant minority of features respond to beds - the central object in the LSUN bedrooms dataset. On the left is a random filter baseline. Comparing to the previous responses there is little to no discrimination and random structure.

Figure 5: 右邊，從Discriminator的最後一層卷積層前六個卷積特徵可視化濾波器。注意到，有相當的反應對應到床-LSUN臥室資料集中心物件。左邊是隨機生成濾波器基線。與之前的反應相比，幾乎沒有區別和隨機結構。

Figure 6: Top row: un-modified samples from model. Bottom row: the same samples generated with dropping out ”window” filters. Some windows are removed, others are transformed into objects with similar visual appearance such as doors and mirrors. Although visual quality decreased, overall scene composition stayed similar, suggesting the generator has done a good job disentangling scene representation from object representation. Extended experiments could be done to remove other objects from the image and modify the objects the generator draws.

Figure 6: 上行：來自模型未經修改的樣本。下行：拿掉『窗戶』濾波器生成的相同樣本。部份窗戶被移除，其它的被轉換為其它相似物件，像是門或鏡子。雖然視覺品質降低，但整體場景組成維持相似，說明Generator在分離場景表示物件方面做得很好。可以進行延申實驗，從影像移除物件和修改Generator繪製的物件。

7 CONCLUSION AND FUTURE WORK

We propose a more stable set of architectures for training generative adversarial networks and we give evidence that adversarial networks learn good representations of images for supervised learning and generative modeling. There are still some forms of model instability remaining - we noticed as models are trained longer they sometimes collapse a subset of filters to a single oscillating mode. Further work is needed to tackle this from of instability. We think that extending this framework to other domains such as video (for frame prediction) and audio (pre-trained features for speech synthesis) should be very interesting. Further investigations into the properties of the learnt latentspace would be interesting as well.

我們提出一套更穩定的架構來訓練對抗式生成網路(GAN)，並提出證據證明，對於監督式學習和生成性建模，對抗性網絡可以學習良好的影像表示。仍然存在某些形式的模型不穩定性，我們注意到，隨著模型訓練時間愈長，它們有時會將一個子集的濾波器折疊成一個單一的振盪模式。需要進一步的努力從穩不定中解決這個問題。我們認為，延申這個框架到其它領域，像影片(用於幀預測)，音樂(用由語音合成的預訓練特徵)應該非常有趣。進一步的提查學習到的潛在空間也非常有趣。

Figure 7: Vector arithmetic for visual concepts. For each column, the Z vectors of samples are averaged. Arithmetic was then performed on the mean vectors creating a new vector Y . The center sample on the right hand side is produce by feeding Y as input to the generator. To demonstrate the interpolation capabilities of the generator, uniform noise sampled with scale ±0.25 was added to Y to produce the 8 other samples. Applying arithmetic in the input space (bottom two examples) results in noisy overlap due to misalignment.

Figure 7: 視覺概念的向量運算。對每一個欄位，vectors- $Z$ 採樣是平均的。對平均向量做計算得到新的vector- $Y$ 。右手邊的中心樣本是以 $Y$ 做輸入給Generator生成。這證明了Generator的插值能力，在 $Y$ 中加入了用±0.25採樣的均勻雜訊，以生成其它8個樣本。在輸入空間執行運算(底下兩個範例)，會導致由於未對準而產生雜訊重疊。

Figure 8: A ”turn” vector was created from four averaged samples of faces looking left vs looking right. By adding interpolations along this axis to random samples we were able to reliably transform their pose.

Figure 8: 一個轉換向量從四個臉部看左、看右的平均樣本建立。通過沿著軸向加入插值到隨機樣本上，我們可以可靠的轉換他們的姿勢。

8 SUPPLEMENTARY MATERIAL

8.1 EVALUATING DCGANS CAPABILITY TO CAPTURE DATA DISTRIBUTIONS

We propose to apply standard classification metrics to a conditional version of our model, evaluating the conditional distributions learned. We trained a DCGAN on MNIST (splitting off a 10K validation set) as well as a permutation invariant GAN baseline and evaluated the models using a nearest neighbor classifier comparing real data to a set of generated conditional samples. We found that removing the scale and bias parameters from batchnorm produced better results for both models. We speculate that the noise introduced by batchnorm helps the generative models to better explore and generate from the underlying data distribution. The results are shown in Table 3 which compares our models with other techniques. The DCGAN model achieves the same test error as a nearest neighbor classifier fitted on the training dataset - suggesting the DCGAN model has done a superb job at modeling the conditional distributions of this dataset. At one million samples per class, the DCGAN model outperforms InfiMNIST (Loosli et al., 2007), a hand developed data augmentation pipeline which uses translations and elastic deformations of training examples. The DCGAN is competitive with a probabilistic generative data augmentation technique utilizing learned per class transformations (Hauberg et al., 2015) while being more general as it directly models the data instead of transformations of the data.

我們建議將標準分類指標應用於模型條件版本，評估學習到的條件分佈。我們在MNIST上(拆分10k驗證資料集)訓練DCGAN以及排列不變的GAN基線，並且使用最近鄰分類器比較實際資料和一組生成的條件樣本來評估模型。我們發現從batchnorm中除移比例和偏差參數對兩個模型都能產生更好的結果。我們推測由batchnorm引入的噪點有助於生成模型更好的從底層數據分佈中探索和生成。結果顯示在Table3，將我們的模型與其他技術進行比較。DCGAN模型實現了與擬合於訓練數據集上的最近鄰分類器相同的測試誤差，這表明DCGAN模型在這個資料集的建模條件分佈方面做得非常出色。每個類別一百萬個樣本，DCGAN模型優於InfiMNIST(Loosli et al., 2007)，一個手工開發的資料增強管道，它使用訓練樣本的平移和彈性變形。DCGAN相較於利用類別轉換學習的概率生成資料增強技術(Hauberg et al., 2015)更通用性，因為它直接模擬資料而不是資料的轉換。

Figure 9: Side-by-side illustration of (from left-to-right) the MNIST dataset, generations from a baseline GAN, and generations from our DCGAN .

Figure 9: 並排(由左至右)MNIST資料集，由不同GAN生成照片

Figure 10: More face generations from our Face DCGAN.

Figure 10: 由Face DCGAN生成的更多臉部照片

Figure 11: Generations of a DCGAN that was trained on the Imagenet-1k dataset.

Figure 11: 以DCGAN生成照片，以Imagenet-1k資料集訓練

藤原栗子工作室

2019年7月30日星期二