# Xception-Deep Learning with Depthwise Separable Convolutions

## Introduction

Xception继承自Inception，下图是Inception v3 module的示意图：

• cross-channel correlations ：使用N个1 x 1 x input_channels的卷积核计算卷积，得到N个feature map；这一步实际上是在计算不同通道之间的相关性。
• spatial correlations ：得到N个feature map之后，在每一个维度的feature map上单独用一个k x k的卷积核计算卷积；这一步实际上是在计算空间相关性。

assume that cross-channel correlations and spatial correlations can be mapped completely separately

Figure 2展示的结构先使用N个1 x 1 x input_channels的卷积核计算卷积，得到N个feature map（cross-channel correlations ）；然后在每一个维度的feature map上单独用一个k x k的卷积核计算卷积（spatial correlations ）。

Figure 2与depthwise separable convolution的区别在于：

1. 顺序不一样：在depthwise separable convolution中是先进行一个channel-wise的spatial convolution，也就是上图的（b），然后是1_1的卷积。而在Figure4中是先进行1_1的卷积，再进行channel-wise的spatial convolution，最后concat。
2. 是否存在非线性激活：在Figure 2中，每一步卷积操作后都有一个ReLU的非线性激活，但是在depthwise separable convolution中没有。

## The Xception architecture

_这里的sparsableConv就是depthwise separable convolution_

In short, the Xception architecture is a linear stack of depthwise separable convolution layers with residual connections.

## Experiment

• 1000-class single-label classification task on the ImageNet dataset
• 17,000-class multi-label classification task on the large-scale JFT dataset.

### Classification performance

The Xception architecture shows a much larger performance improvement on the JFT dataset compared to the
ImageNet dataset. We believe this may be due to the fact that Inception V3 was developed with a focus on ImageNet and may thus be by design over-fit to this specific task. On the other hand, neither architecture was tuned for JFT. It is likely that a search for better hyperparameters for Xception on ImageNet (in particular optimization parameters and regularization parameters) would yield significant additional improvement.

### Conclusions

We showed how convolutions and depthwise separable convolutions lie at both extremes of a discrete spectrum,
with Inception modules being an intermediate point in between. This observation has led to us to propose replacing Inception modules with depthwise separable convolutions in neural computer vision architectures. We presented a novel architecture based on this idea, named Xception, which has a similar parameter count as Inception V3. Compared to Inception V3, Xception shows small gains in classification performance on the ImageNet dataset and large gains on the JFT dataset. We expect depthwise separable convolutions to become a cornerstone of convolutional neural network architecture design in the future, since they offer similar properties as Inception modules, yet are as easy to use as regular convolution layers.