||
总览:详细介绍各种卷积,并有PDF,需要的留言!看到后回复。
如果您听说过深度学习中的不同类型的卷积(例如2D / 3D / 1x1 /转置/扩张(Atrous)/空间可分离/深度可分离/展平/分组/混洗分组卷积),并且混淆了它们的实际含义,本文旨在帮助您了解它们的实际工作原理。
The content of this article includes:
Convolution v.s. Cross-correlation
Convolution in Deep Learning (single channel version, multi-channel version)
3D Convolution
1 x 1 Convolution
Convolution Arithmetic
Transposed Convolution (Deconvolution, checkerboard artifacts)
Dilated Convolution (Atrous Convolution)
Separable Convolution (Spatially Separable Convolution, Depthwise Convolution)
Flattened Convolution
Grouped Convolution
Shuffled Grouped Convolution
Pointwise Grouped Convolution
卷积是信号处理,图像处理和其他工程/科学领域中广泛使用的技术。在深度学习中,以这种技术命名了一种模型架构,即卷积神经网络(CNN)。但是,深度学习中的卷积本质上是信号/图像处理中的互相关。这两个操作之间有细微的差别。
在不深入细节的情况下,这就是区别。在信号/图像处理中,卷积定义为:
两个函数的乘积在一个函数被反转和移位后的积分。以下可视化演示了该想法。
图1. 卷积信号处理。过滤器g反转,然后沿水平轴滑动。对于每个位置,我们计算f和反向g之间的交点的面积。相交区域是该特定位置处的卷积值。通过此链接采用和编辑图像。
此处,函数g是过滤器。将其反转,然后沿水平轴滑动。对于每个位置,我们计算f和反向g之间的交点的面积。该相交区域是该特定位置处的卷积值。
另一方面,互相关被称为两个函数的滑点积或滑动内积。互相关的滤波器不会反转。它直接在函数f中滑动。f和g之间的交集区域是互相关。下图显示了相关和互相关之间的差异。
图2.信号处理中卷积和互相关之间的差异。图片由Wikipedia采纳和编辑。
在深度学习中,卷积中的过滤器不会反转。严格来说,它是互相关的。我们本质上执行逐元素的乘法和加法。但这只是惯例,在深度学习中称之为卷积。很好,因为可以在训练过程中了解滤波器的权重。如果以上示例中的逆函数g是正确的函数,则在训练后,学习到的滤波器将看起来像逆函数g。因此,不需要像真正的卷积那样在训练之前先反转滤波器。
卷积的目的是从输入中提取有用的特征。在图像处理中,可以选择多种卷积滤波器。每种类型的过滤器都有助于从输入图像中提取不同的方面或特征,例如水平/垂直/对角线边缘。类似地,在卷积神经网络中,使用过滤器通过卷积提取不同的特征,这些过滤器的权重是在训练过程中自动学习的。然后将所有这些提取的特征“组合”以做出决策。
进行卷积有一些优点,例如权重共享和翻译不变式。卷积还考虑了像素的空间关系。这些功能尤其有用,特别是在许多计算机视觉任务中,因为这些任务通常涉及识别某些组件与其他组件在空间上具有一定关系的对象。
图3. 卷积:单通道版本
在深度学习中,卷积是逐元素的乘法和加法。对于具有1个通道的图像,下图演示了卷积。这里的过滤器是一个3 x 3的矩阵,元素为[[0,1,2],[2,2,0],[0,1,2]]。过滤器在输入中滑动。在每个位置,它都在进行逐元素的乘法和加法。每个滑动位置以一个数字结尾。最终输出为3 x 3矩阵。(请注意,在本示例中,步幅= 1,填充=0。这些概念将在下面的算术部分中进行介绍。
图4. 不同的通道强调原始图像的不同方面。该图像拍摄于Yuanyang, Yunnan, China
Another example of multi-channel data is the layers in Convolutional Neural Network. A convolutional-net layer usually consists of multiple channels (typically hundreds of channels). Each channel describes different aspects of the previous layer. How do we make transition between layers with different depth? How do we transform a layer with depth n to the following layer with depth m?
Before describing the process, we would like to clarify a few terminologies: layers(层), channels(通道), feature maps(特征图), filters(过滤器), and kernels(核). From a hierarchical point of view, the concepts of layers and filters are at the same level, while channels and kernels are at one level below. Channels and feature maps are the same thing. A layer could have multiple channels (or feature maps): an input layer has 3 channels if the inputs are RGB images. “channel” is usually used to describe the structure of a “layer”. Similarly, “kernel” is used to describe the structure of a “filter”.
图5. Difference between “layer” (“filter”) and “channel” (“kernel”).
The difference between filter and kernel is a bit tricky. Sometimes, they are used interchangeably, which could create confusions. Essentially, these two terms have subtle difference. A “Kernel” refers to a 2D array of weights. The term “filter” is for 3D structures of multiple kernels stacked together. For a 2D filter, filter is same as kernel. But for a 3D filter and most convolutions in deep learning, a filter is a collection of kernels. Each kernel is unique, emphasizing different aspects of the input channel.
With these concepts, the multi-channel convolution goes as the following. Each kernel is applied onto an input channel of the previous layer to generate one output channel. This is a kernel-wise process. We repeat such process for all kernels to generate multiple channels. Each of these channels are then summed together to form one single output channel. The following illustration should make the process clearer.
Here the input layer is a 5 x 5 x 3 matrix, with 3 channels. The filter is a 3 x 3 x 3 matrix. First, each of the kernels in the filter are applied to three channels in the input layer, separately. Three convolutions are performed, which result in 3 channels with size 3 x 3.
图6. The first step of 2D convolution for multi-channels: each of the kernels in the filter are applied to three channels in the input layer, separately. The image is adopted from this link.
Then these three channels are summed together (element-wise addition) to form one single channel (3 x 3 x 1). This channel is the result of convolution of the input layer (5 x 5 x 3 matrix) using a filter (3 x 3 x 3 matrix).
图7. The second step of 2D convolution for multi-channels: then these three channels are summed together (element-wise addition) to form one single channel. The image is adopted from this link.
Equivalently, we can think of this process as sliding a 3D filter matrix through the input layer. Notice that the input layer and the filter have the same depth (channel number = kernel number). The 3D filter moves only in 2-direction, height & width of the image (That’s why such operation is called as 2D convolution although a 3D filter is used to process 3D volumetric data). At each sliding position, we perform element-wise multiplication and addition, which results in a single number. In the example shown below, the sliding is performed at 5 positions horizontally and 5 positions vertically. Overall, we get a single output channel.
图8. Another way to think about 2D convolution: thinking of the process as sliding a 3D filter matrix through the input layer. Notice that the input layer and the filter have the same depth (channel number = kernel number). The 3D filter moves only in 2-direction, height & width of the image (That’s why such operation is called as 2D convolution although a 3D filter is used to process 3D volumetric data). The output is a one-layer matrix.
Now we can see how one can make transitions between layers with different depth. Let’s say the input layer has Din channels, and we want the output layer has Dout channels. What we need to do is to just apply Dout filters to the input layer. Each filter has Din kernels. Each filter provides one output channel. After applying Dout filters, we have Dout channels, which can then be stacked together to form the output layer.
图9. Standard 2D convolution. Mapping one layer with depth Din to another layer with depth Dout, by using Dout filters.
In the last illustration of the previous section, we see that we were actually perform convolution to a 3D volume. But typically, we still call that operation as 2D convolution in Deep Learning. It’s a 2D convolution on a 3D volumetric data. The filter depth is same as the input layer depth. The 3D filter moves only in 2-direction (height & width of the image). The output of such operation is a 2D image (with 1 channel only).
Naturally, there are 3D convolutions. They are the generalization of the 2D convolution. Here in 3D convolution, the filter depth is smaller than the input layer depth (kernel size < channel size). As a result, the 3D filter can move in all 3-direction (height, width, channel of the image). At each position, the element-wise multiplication and addition provide one number. Since the filter slides through a 3D space, the output numbers are arranged in a 3D space as well. The output is then a 3D data.
图10. In 3D convolution, a 3D filter can move in all 3-direction (height, width, channel of the image). At each position, the element-wise multiplication and addition provide one number. Since the filter slides through a 3D space, the output numbers are arranged in a 3D space as well. The output is then a 3D data.
Similar as 2D convolutions which encode spatial relationships of objects in a 2D domain, 3D convolutions can describe the spatial relationships of objects in the 3D space. Such 3D relationship is important for some applications, such as in 3D segmentations / reconstructions of biomedical imagining, e.g. CT and MRI where objects such as blood vessels meander around in the 3D space.
Since we talked about depth-wise operation in the previous section of 3D convolution, let’s look at another interesting operation, 1 x 1 convolution.
You may wonder why this is helpful. Do we just multiply a number to every number in the input layer? Yes and No. The operation is trivial for layers with only one channel. There, we multiply every element by a number.
Things become interesting if the input layer has multiple channels. The following picture illustrates how 1 x 1 convolution works for an input layer with dimension H x W x D. After 1 x 1 convolution with filter size 1 x 1 x D, the output channel is with dimension H x W x 1. If we apply N such 1 x 1 convolutions and then concatenate results together, we could have a output layer with dimension H x W x N.
图11. 1 x 1 convolution, where the filter size is 1 x 1 x D.
Initially, 1 x 1 convolutions were proposed in the Network-in-network paper. They were then highly used in the Google Inception paper. A few advantages of 1 x 1 convolutions are:
Dimensionality reduction for efficient computations(降维以实现高效计算)
Efficient low dimensional embedding, or feature pooling(高效的低维嵌入或特征池化)
Applying nonlinearity again after convolution(卷积后再次应用非线性)
The first two advantages can be observed in the image above. After 1 x 1 convolution, we significantly reduce the dimension depth-wise. Say if the original input has 200 channels, the 1 x 1 convolution will embed these channels (features) into a single channel. The third advantage comes in as after the 1 x 1 convolution, non-linear activation such as ReLU can be added. The non-linearity allows the network to learn more complex function.
These advantages were described in Google’s Inception paper as:
“One big problem with the above modules, at least in this naïve form, is that even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters.
This leads to the second idea of the proposed architecture: judiciously applying dimension reductions and projections wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch… That is, 1 x 1 convolutions are used to compute reductions before the expensive 3 x 3 and 5 x 5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation which makes them dual-purpose.”
One interesting perspective regarding 1 x 1 convolution comes from Yann LeCun “In Convolutional Nets, there is no such thing as “fully-connected layers”. There are only convolution layers with 1x1 convolution kernels and a full connection table.”
We now know how to deal with depth in convolution. Let’s move on to talk about how to handle the convolution in the other two directions (height & width), as well as important convolution arithmetic.
Here are a few terminologies:
Kernel size: kernel is discussed in the previous section. The kernel size defines the field of view of the convolution.
Stride: it defines the step size of the kernel when sliding through the image. Stride of 1 means that the kernel slides through the image pixel by pixel. Stride of 2 means that the kernel slides through image by moving 2 pixels per step (i.e., skipping 1 pixel). We can use stride (>= 2) for downsampling an image.
Padding: the padding defines how the border of an image is handled. A padded convolution (‘same’ padding in Tensorflow) will keep the spatial output dimensions equal to the input image, by padding 0 around the input boundaries if necessary. On the other hand, unpadded convolution (‘valid’ padding in Tensorflow) only perform convolution on the pixels of the input image, without adding 0 around the input boundaries. The output size is smaller than the input size.
This following illustration describes a 2D convolution using a kernel size of 3, stride of 1 and padding of 1.
图12. There is an excellent article about detailed arithmetic (“A guide to convolution arithmetic for deep learning”). One may refer to it for detailed descriptions and examples for different combinations of kernel size, stride, and padding. Here I just summarize results for the most general case.
For an input image with size of i, kernel size of k, padding of p, and stride of s, the output image from convolution has size o:
For many applications and in many network architectures, we often want to do transformations going in the opposite direction of a normal convolution, i.e. we’d like to perform up-sampling. A few examples include generating high-resolution images and mapping low dimensional feature map to high dimensional space such as in auto-encoder or semantic segmentation. (In the later example, semantic segmentation first extracts feature maps in the encoder and then restores the original image size in the decoder so that it can classify every pixel in the original image.)
Traditionally, one could achieve up-sampling by applying interpolation schemes or manually creating rules. Modern architectures such as neural networks, on the other hand, tend to let the network itself learn the proper transformation automatically, without human intervention. To achieve that, we can use the transposed convolution.
The transposed convolution is also known as deconvolution, or fractionally strided convolution in the literature. However, it’s worth noting that the name “deconvolution” is less appropriate, since transposed convolution is not the real deconvolution as defined in signal / image processing. Technically speaking, deconvolution in signal processing reverses the convolution operation. That is not the case here. Because of that, some authors are strongly against calling transposed convolution as deconvolution. People call it deconvolution mainly because of simplicity. Later, we will see why calling such operation as transposed convolution is natural and more appropriate.
It is always possible to implement a transposed convolution with a direct convolution. For an example in the image below, we apply transposed convolution with a 3 x 3 kernel over a 2 x 2 input padded with a 2 x 2 border of zeros using unit strides. The up-sampled output is with size 4 x 4(左).
Interestingly enough, one can map the same 2 x 2 input image to a different image size, by applying fancy padding & stride. Below, transposed convolution is applied over the same 2 x 2 input (with 1 zero inserted between inputs) padded with a 2 x 2 border of zeros using unit strides. Now the output is with size 5 x 5(右).
图13. 左:Up-sampling a 2 x 2 input to a 4 x 4 output. Image is adopted from this link. 右:Up-sampling a 2 x 2 input to a 5 x 5 output. Image is adopted from this link.
在上面的示例中查看转置卷积可以帮助我们建立一些直觉。但是要概括其应用,查看一下如何通过计算机中的矩阵乘法来实现它是有益的。从那里,我们还可以看到为什么“转置卷积”是一个合适的名称。
在卷积中,让我们将C定义为内核,将Large定义为输入图像,将Small定义为卷积中的输出图像。卷积(矩阵乘法)后,我们将大图像下采样为小输出图像。矩阵乘法中的卷积实现如下:C x Large = Small。
以下示例显示了这种操作的工作方式。它将输入展平为16 x 1矩阵,并将内核转换为稀疏矩阵(4 x 16)。然后在稀疏矩阵和展平输入之间应用矩阵乘法。然后,将所得矩阵(4 x 1)转换回2 x 2输出。
图14. 用于卷积的矩阵乘法:从大输入图像(4 x 4)到小输出图像(2 x 2)
现在,如果我们在等式的两边乘以矩阵C的转置,并利用矩阵与其转置矩阵的相乘得到单位矩阵的性质,则我们有以下公式CT x Small = Large,如下所示:
图15. 用于卷积的矩阵乘法:从小输入图像(2 x 2)到大输出图像(4 x 4)
As you can see here, we perform up-sampling from a small image to a large image. That is what we want to achieve. And now, you can also see where the name “transposed convolution” comes from.
The general arithmetic for transposed convolution can be found from Relationship 13 and Relationship 14 in this excellent article (“A guide to convolution arithmetic for deep learning”).
One unpleasant behavior that people observe when using transposed convolution is the so-called checkerboard artifacts.
图16. A few examples of checkerboard artifacts. Images are adopted from this paper.
The paper “Deconvolution and Checkerboard Artifacts” has an excellent description about this behavior. Please check out this article for more details. Here, I just summarize a few key points.
Checkerboard artifacts result from “uneven overlap” of transposed convolution. Such overlap puts more of the metaphorical paint in some places than others.
棋盘伪像是由转置卷积的“不均匀重叠”引起的。 这种重叠在某些地方比其他地方更多地隐喻了绘画。
In the image below, the layer on the top is the input layer, and the layer on the bottom is the output layer after transposed convolution. During transposed convolution, a layer with small size is mapped to a layer with larger size.
In the example (a), the stride is 1 and the filer size is 2. As outlined in red, the first pixel on the input maps to the first and second pixels on the output. As outlined in green, the second pixel on the input maps to the second and the third pixels on the output. The second pixel on the output receives information from both the first and the second pixels on the input. Overall, the pixels in the middle portion of the output receive same amount of information from the input. Here exist a region where kernels overlapped. As the filter size is increased to 3 in the example (b), the center portion that receives most information shrinks. But this may not be a big deal, since the overlap is still even. The pixels in the center portion of the output receive same amount of information from the input.
在示例(a)中,步幅为1,过滤器大小为2。如红色所示,输入上的第一个像素映射到输出上的第一个像素和第二个像素。 如绿色轮廓所示,输入上的第二个像素映射到输出上的第二个和第三个像素。 输出上的第二像素从输入上的第一和第二像素接收信息。 总体而言,输出中间部分的像素从输入接收相同数量的信息。 这里存在一个内核重叠的区域。 在示例(b)中,随着滤波器大小增加到3,接收最多信息的中心部分将缩小。 但这可能没什么大不了的,因为重叠仍然是均匀的。 输出中心部分的像素从输入接收相同数量的信息。
图17. The image is adopted and modified from the paper (link).
Now for the example below, we change stride = 2. In the example (a) where filter size = 2, all pixels on the output receive same amount of information from the input. They all receive information from a single pixel on the input. There is no overlap of transposed convolution here.
现在,对于下面的示例,我们将更改stride = 2。在示例(a)中,过滤器大小= 2,输出上的所有像素都从输入接收相同量的信息。 它们都从输入上的单个像素接收信息。 此处转置卷积没有重叠。
图18. The image is adopted and modified from the paper (link).
If we change the filter size to 4 in the example (b), the evenly overlapped region shrinks. But still, one can use the center portion of the output as the valid output, where each pixel receives the same amount of information from the input.
如果在示例(b)中将滤镜大小更改为4,则均匀重叠的区域会缩小。 但是,仍然可以将输出的中心部分用作有效输出,其中每个像素从输入接收相同量的信息。
However, things become interesting if we change the filter size to 3 and 5 in the example (c) and (d). For these two cases, every pixel on the output receives different amount of information compared to its adjacent pixels. One cannot find a continuous and evenly overlapped region on the output.
但是,如果在示例(c)和(d)中将滤波大小更改为3和5,事情就会变得很有趣。 对于这两种情况,与相邻像素相比,输出上的每个像素都会收到不同数量的信息。 人们无法在输出上找到连续且均匀重叠的区域。
The transposed convolution has uneven overlap when the filter size is not divisible by the stride. This “uneven overlap” puts more of the paint in some places than others, thus creates the checkerboard effects. In fact, the unevenly overlapped region tends to be more extreme in two dimensions. There, two patterns are multiplied together, the unevenness gets squared.
当滤波尺寸不能被步幅整除时,转置的卷积具有不均匀的重叠。 这种“不均匀的重叠”会使某些地方的涂料比其他地方多,从而产生棋盘效果。 实际上,不均匀重叠的区域在二维上趋于更加极端。 在那里,两个图案相乘,不均匀度变为平方倍。
Two things one could do to reduce such artifacts, while applying transposed convolution. First, make sure you use a filer size that is divided by your stride, avoiding the overlap issue. Secondly, one can use transposed convolution with stride = 1, which helps to reduce the checkerboard effects. However, artifacts can still leak through, as seen in many recent models.
在应用转置卷积的同时,可以做两件事来减少此类伪像。 首先,请确保使用的滤波size可以被stride整除,以避免重叠问题。 其次,可以使用步幅= 1的转置卷积,这有助于减少棋盘效应。 但是,正如许多最新模型中所看到的那样,伪影仍可能泄漏出来。
The paper further proposed a better up-sampling approach: resize the image first (using nearest-neighbor interpolation or bilinear interpolation) and then do a convolutional layer. By doing that, the authors avoid the checkerboard effects. You may want to try it for your applications.
本文还提出了一种更好的上采样方法:首先调整图像大小(使用最近邻插值或双线性插值),然后再进行卷积层。 通过这样做,作者避免了棋盘效应。 您可能想为您的应用程序尝试一下。
Dilated convolution was introduced in the paper (link) and the paper “Multi-scale context aggregation by dilated convolutions” (link).
This is the standard discrete convolution:
The dilated convolution follows:
When l = 1, the dilated convolution becomes as the standard convolution.
图19. The standard convolution(左). The dilated convolution(右).
Intuitively, dilated convolutions “inflate” the kernel by inserting spaces between the kernel elements. This additional parameter l (dilation rate) indicates how much we want to widen the kernel. Implementations may vary, but there are usually l-1 spaces inserted between kernel elements. The following image shows the kernel size when l = 1, 2, and 4.
图 20. Receptive field for the dilated convolution. We essentially observe a large receptive field without adding additional costs.
In the image, the 3 x 3 red dots indicate that after the convolution, the output image is with 3 x 3 pixels. Although all three dilated convolutions provide the output with the same dimension, the receptive field observed by the model is dramatically different. The receptive filed is 3 x 3 for l =1. It is 7 x 7 for l =2. The receptive filed increases to 15 x 15 for l = 4. Interestingly, the numbers of parameters associated with these operations are essentially identical. We “observe” a large receptive filed without adding additional costs. Because of that, dilated convolution is used to cheaply increase the receptive field of output units without increasing the kernel size, which is especially effective when multiple dilated convolutions are stacked one after another.
在图像中,3 x 3红点表示卷积后,输出图像具有3 x 3像素。尽管所有三个膨胀的卷积都为输出提供相同的维数,但是模型观察到的感受野却截然不同。对于l = 1,感受野为3 x 3 。对于l = 2,它是7 x 7 。当l = 3时,感受野增加到15 x 15 。有趣的是,与这些操作关联的参数数量基本相同。我们“观察”了一个大型的接收文件,而没有增加额外的费用。 因此,膨胀卷积用于廉价地增加输出单元的感受野而不增加内核大小,这在多个膨胀卷积一个接一个地堆叠时特别有效。
The authors in the paper “Multi-scale context aggregation by dilated convolutions” build a network out of multiple layers of dilated convolutions, where the dilation rate l increases exponentially at each layer. As a result, the effective receptive field grows exponentially while the number of parameters grows only linearly with layers!
The dilated convolution in the paper is used to systematically aggregate multi-scale contextual information without losing resolution. The paper shows that the proposed module increases the accuracy of state-of-the-art semantic segmentation systems at that time (2016). Please check out the paper for more information.
Separable Convolutions are used in some neural net architectures, such as the MobileNet (Link). One can perform separable convolution spatially (spatially separable convolution) or depthwise (depthwise separable convolution).
可分离卷积用于某些神经网络体系结构中,例如MobileNet(Link)。可以在空间上(空间上可分离的卷积)或在深度上(深度上可分离的卷积)执行可分离的卷积。
空间上可分离的卷积在图像的2D空间维度(即高度和宽度)上运行。从概念上讲,空间上可分离的卷积将卷积分解为两个单独的运算。对于下面显示的示例,作为3x3内核的Sobel内核分为3x1和1x3内核。
图 21. Sobel内核可以分为3 x 1和1 x 3内核
在卷积中,3x3内核直接与图像卷积。在空间上可分离的卷积中,3x1内核首先与图像进行卷积。然后应用1x3内核。在执行相同操作时,这将需要6个参数而不是9个参数。
而且,在空间上可分离的卷积中,与卷积相比,需要较少的矩阵乘法。举一个具体的例子,在具有3 x 3内核(步幅= 1,填充= 0)的5 x 5图像上进行卷积需要在水平3个位置(垂直3个位置)上扫描内核。总共9个位置,在下图中以点表示。在每个位置上,将应用9个按元素的乘法。总体来说,这是9 x 9 = 81乘法。
图 22. 1个通道的标准卷积
另一方面,对于空间上可分离的卷积,我们首先在5 x 5图像上应用3 x 1 filter。我们在水平5个位置和垂直3个位置扫描这样的内核。这是5×3 = 15在总的位置,表示为下面的图像上的点。在每个位置,应用3个逐元素的乘法。那就是15 x 3 = 45乘法。现在,我们获得了一个3 x 5的矩阵。现在,此矩阵与1 x 3内核卷积,该内核在水平3个位置和垂直3个位置上扫描矩阵。对于这9个位置中的每一个,将应用3个按元素的乘法。此步骤需要9 x 3 = 27乘法。因此,总的来说,空间上可分离的卷积需要45 + 27 = 72乘法,小于卷积。
图 23. 1通道的空间可分离卷积
Let’s generalize the above examples a little bit. Let’s say we now apply convolutions on a N x N image with a m x m kernel, with stride=1 and padding=0. Traditional convolution requires (N-2) x (N-2) x m x m multiplications. Spatially separable convolution requires N x (N-2) x m + (N-2) x (N-2) x m = (2N-2) x (N-2) x m multiplications. The ratio of computation costs between spatially separable convolution and the standard convolution is
让我们对以上示例进行一些概括。假设我们现在将卷积应用于具有m x m内核的N x N图像,步幅= 1,填充= 0。传统卷积需要(N-2)x(N-2)x m x m 乘法。空间上可分离的卷积需要N x(N-2)x m +(N-2)x(N-2)x m =(2N-2)x(N-2)x m 乘法。空间可分离卷积和标准卷积之间的计算成本比为:
对于图像尺寸N大于滤镜尺寸m(N >> m)的图层,此比率变为2 / m。这意味着在这种渐近情况下(N >> m),空间可分离卷积的计算成本是3 x 3滤波器标准卷积的2/3。对于5 x 5滤镜为2/5,对于7 x 7滤镜为2/7,依此类推。
尽管空间上可分离的卷积节省了成本,但很少在深度学习中使用它。主要原因之一是,并非所有内核都可以分为两个较小的内核。如果我们用空间上可分离的卷积代替所有传统的卷积,我们将限制自己在训练过程中搜索所有可能的核。训练结果可能不是最佳的。
现在,让我们继续进行深度可分离卷积,它是深度学习(例如MobileNet和Xception)中更为常用的卷积。深度方向可分离卷积包括两个步骤:深度方向卷积和1x1卷积。
在描述这些步骤之前,值得回顾一下我们在前几节中讨论的2D卷积和1 x 1卷积。让我们快速回顾一下标准2D卷积。举一个具体的例子,假设输入层的尺寸为7 x 7 x 3(高度x宽度x通道),而滤镜的尺寸为3 x 3 x3。用一个filter进行2D卷积后,输出层为大小为5 x 5 x 1(只有1个通道)。
图 24. 标准2D卷积,使用1个滤镜创建1层输出
通常,在两个神经网络层之间应用多个过滤器。假设这里有128个过滤器。应用这128个2D卷积后,我们得到128个5 x 5 x 1输出贴图。然后,我们将这些地图堆叠到大小为5 x 5 x 128的单层中。这样做,我们将输入层(7 x 7 x 3)转换为输出层(5 x 5 x 128)。在扩展深度的同时,空间尺寸(即高度和宽度)会缩小。
图 25. 标准2D卷积可使用128个滤镜创建128层输出
现在,通过深度可分离卷积,让我们看看如何实现相同的变换。
First, we apply depthwise convolution to the input layer. Instead of using a single filter of size 3 x 3 x 3 in 2D convolution, we used 3 kernels, separately. Each filter has size 3 x 3 x 1. Each kernel convolves with 1 channel of the input layer (1 channel only, not all channels!). Each of such convolution provides a map of size 5 x 5 x 1. We then stack these maps together to create a 5 x 5 x 3 image. After this, we have the output with size 5 x 5 x 3. We now shrink the spatial dimensions, but the depth is still the same as before.
首先,我们将深度卷积应用于输入层。我们没有在2D卷积中使用单个大小为3 x 3 x 3的滤波器,而是分别使用了3个内核。每个过滤器的大小为3 x 3 x1。每个内核与输入层的1个通道卷积(仅1个通道,而不是所有通道!)。每个这样的卷积都提供了一张5 x 5 x 1的地图。然后,我们将这些地图堆叠在一起以创建5 x 5 x 3的图像。此后,我们得到大小为5 x 5 x 3的输出。我们现在缩小空间尺寸,但深度仍与以前相同。
图 26. Depthwise separable convolution — first step: Instead of using a single filter of size 3 x 3 x 3 in 2D convolution, we used 3 kernels, separately. Each filter has size 3 x 3 x 1. Each kernel convolves with 1 channel of the input layer (1 channel only, not all channels!). Each of such convolution provides a map of size 5 x 5 x 1. We then stack these maps together to create a 5 x 5 x 3 image. After this, we have the output with size 5 x 5 x 3.
As the second step of depthwise separable convolution, to extend the depth, we apply the 1x1 convolution with kernel size 1x1x3. Convolving the 5 x 5 x 3 input image with each 1 x 1 x 3 kernel provides a map of size 5 x 5 x 1.
作为深度可分离卷积的第二步,为了扩展深度,我们应用内核大小为1x1x3的1x1卷积。将5 x 5 x 3输入图像与每个1 x 1 x 3内核进行卷积可以得到大小为5 x 5 x 1的map。
图 27. 深度可分离卷积的第二步
因此,应用128个1x1卷积后,我们可以得到一个大小为5 x 5 x 128的图层。
图 28. 深度可分离卷积-第二步:应用多个1 x 1卷积以修改深度
通过这两个步骤,深度可分离卷积将输入层(7 x 7 x 3)转换为输出层(5 x 5 x 128)。下图显示了深度可分离卷积的整个过程:
图 29. 深度可分离卷积的整个过程
So, what’s the advantage of doing depthwise separable convolutions? Efficiency!One needs much less operations for depthwise separable convolutions compared to 2D convolutions.
Let’s recall the computation costs for our example of 2D convolutions. There are 128 3x3x3 kernels that move 5x5 times. That is 128 x 3 x 3 x 3 x 5 x 5 = 86,400 multiplications.
How about the separable convolution? In the first depthwise convolution step, there are 3 3x3x1 kernels that moves 5x5 times. That is 3x3x3x1x5x5 = 675 multiplications. In the second step of 1 x 1 convolution, there are 128 1x1x3 kernels that moves 5x5 times. That is 128 x 1 x 1 x 3 x 5 x 5 = 9,600 multiplications. Thus, overall, the depthwise separable convolution takes 675 + 9600 = 10,275 multiplications. This is only about 12% of the cost of the 2D convolution!
So, for an image with arbitrary size, how much time can we save if we apply depthwise separable convolution. Let’s generalize the above examples a little bit. Now, for an input image of size H x W x D, we want to do 2D convolution (stride=1, padding=0) with Nc kernels of size h x h x D, where h is even. This transform the input layer (H x W x D) into the output layer (H-h+1 x W-h+1 x Nc). The overall multiplications needed is:Nc x h x h x D x (H-h+1) x (W-h+1).
On the other hand, for the same transformation, the multiplication needed for depthwise separable convolution is
D x h x h x 1 x (H-h+1) x (W-h+1) + Nc x 1 x 1 x D x (H-h+1) x (W-h+1) = (h x h + Nc) x D x (H-h+1) x (W-h+1)
The ratio of multiplications between depthwise separable convolution and 2D convolution is now:
For most modern architectures, it is common that the output layer has many channels, e.g. several hundreds if not several thousands. For such layers (Nc >> h), then the above expression reduces down to 1 / h / h. It means for this asymptotic expression, if 3 x 3 filters are used, 2D convolutions spend 9 times more multiplications than a depthwise separable convolutions. For 5 x 5 filters, 2D convolutions spend 25 times more multiplications.
Is there any drawback of using depthwise separable convolutions? Sure, there are. The depthwise separable convolutions reduces the number of parameters in the convolution. As such, for a small model, the model capacity may be decreased significantly if the 2D convolutions are replaced by depthwise separable convolutions. As a result, the model may become sub-optimal. However, if properly used, depthwise separable convolutions can give you the efficiency without dramatically damaging your model performance.
The flattened convolution was introduced in the paper “Flattened convolutional neural networks for feedforward acceleration”. Intuitively, the idea is to apply filter separation. Instead of applying one standard convolution filter to map the input layer to an output layer, we separate this standard filter into 3 1D filters. Such idea is similar as that in the spatial separable convolution described above, where a spatial filter is approximated by two rank-1 filters.
在“前馈加速的扁平卷积神经网络”一文中介绍了扁平卷积。直观上,该想法是应用过滤器分离。我们没有应用一个标准的卷积filter将输入层映射到输出层,而是将该标准filter分为3个1D滤镜。这种想法与上述空间可分离卷积中的想法相似,其中空间滤波器由两个 秩-1 filters近似。
One should notice that if the standard convolution filter is a rank-1 filter, such filter can always be separated into cross-products of three 1D filters. But this is a strong condition and the intrinsic rank of the standard filter is higher than one in practice. As pointed out in the paper “As the difficulty of classification problem increases, the more number of leading components is required to solve the problem… Learned filters in deep networks have distributed eigenvalues and applying the separation directly to the filters results in significant information loss.”
To alleviate such problem, the paper restricts connections in receptive fields so that the model can learn 1D separated filters upon training. The paper claims that by training with flattened networks that consists of consecutive sequence of 1D filters across all directions in 3D space provides comparable performance as standard convolutional networks, with much less computation costs due to the significant reduction of learning parameters.
Grouped convolution was introduced in the AlexNet paper (link) in 2012. The main reason of implementing it was to allow the network training over two GPUs with limited memory (1.5 GB memory per GPU). The AlexNet below shows two separate convolution paths at most of the layers. It’s doing model-parallelization across two GPUs (of course one can do multi-GPUs parallelization if more GPUs are available).
分组卷积在2012年的AlexNet论文(链接)中引入。实现它的主要原因是允许在内存有限(每个GPU 1.5 GB内存)的两个GPU上进行网络训练。下面的AlexNet在大多数层上显示了两个单独的卷积路径。它正在两个GPU上进行模型并行化(当然,如果有更多GPU可用,则可以进行多GPU并行化)。
Here we describe how the grouped convolutions work. First of all, conventional 2D convolutions follow the steps showing below. In this example, the input layer of size (7 x 7 x 3) is transformed into the output layer of size (5 x 5 x 128) by applying 128 filters (each filter is of size 3 x 3 x 3). Or in general case, the input layer of size (Hin x Win x Din) is transformed into the output layer of size (Hout x Wout x Dout) by applying Dout kernels (each is of size h x w x Din).
在这里,我们描述了分组卷积的工作原理。首先,常规2D卷积遵循以下步骤。在此示例中,通过应用128个滤镜(每个滤镜的尺寸为3 x 3 x 3),将尺寸为(7 x 7 x 3)的输入层转换为尺寸为(5 x 5 x 128)的输出层。或在一般情况下,通过应用Dout内核将大小为(Hin x Win x Din)的输入层转换为大小为(Hout x Wout x Dout)的输出层(每个内核的大小为h x w x Din)。
图 30. Standard 2D convolution.
In grouped convolution, the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth. The following examples can make this clearer.
在分组卷积中,filter分为不同的组。每个组负责一定深度的常规2D卷积。以下示例可以使这一点更加清楚。
图 31. Grouped convolution with 2 filter groups.
Above is the illustration of grouped convolution with 2 filter groups. In each filter group, the depth of each filter is only half of the that in the nominal 2D convolutions. They are of depth Din / 2. Each filter group contains Dout /2 filters. The first filter group (red) convolves with the first half of the input layer ([:, :, 0:Din/2]), while the second filter group (blue) convolves with the second half of the input layer ([:, :, Din/2:Din]). As a result, each filter group creates Dout/2 channels. Overall, two groups create 2 x Dout/2 = Dout channels. We then stack these channels in the output layer with Dout channels.
上面是带有2个过滤器组的分组卷积的说明。在每个滤波器组中,每个滤波器的深度仅为标称2D卷积的深度的一半。它们的深度为Din / 2。每个过滤器组包含Dout / 2过滤器。第一个过滤器组(红色)与输入层的上半部分([:,:,0:Din / 2])卷积,而第二个过滤器组(蓝色)与输入层的后半部分([: ,:,Din / 2:Din])。结果,每个过滤器组都会创建Dout / 2通道。总体而言,两组创建2个Dout / 2 = Dout通道。然后,将这些通道与Dout通道堆叠在输出层中。
You may already observe some linkage and difference between grouped convolution and the depthwise convolution used in the depthwise separable convolution. If the number of filter groups is the same as the input layer channel, each filter is of depth Din / Din = 1. This is the same filter depth as in depthwise convolution.
您可能已经观察到分组卷积和深度可分离卷积中使用的深度卷积之间的某种联系和差异。如果filter组的数量与输入层通道的数量相同,则每个滤filter的深度Din / Din =1。这与深度卷积中的filter深度相同。
On the other hand, each filter group now contains Dout / Din filters. Overall, the output layer is of depth Dout. This is different from that in depthwise convolution, which does not change the layer depth. The layer depth is extended later by 1x1 convolution in the depthwise separable convolution.
另一方面,每个过滤器组现在都包含Dout / Din过滤器。总体而言,输出层的深度为Dout。这与深度卷积不同,后者不会更改图层深度。层深度随后在深度可分离卷积中通过1x1卷积扩展。
There are a few advantages of doing grouped convolution.
The first advantage is the efficient training. Since the convolutions are divided into several paths, each path can be handled separately by different GPUs. This procedure allows the model training over multiple GPUs, in a parallel fashion. Such model-parallelization over multi-GPUs allows more images to be fed into the network per step, compared to training with everything with one GPU. The model-parallelization is considered to be better than data parallelization. The later one split the dataset into batches and then we train on each batch. However, when the batch size becomes too small, we are essentially doing stochastic than batch gradient descent. This would result in slower and sometimes poorer convergence.
第一个优势是有效的训练。由于卷积被划分为多个路径,因此每个路径可以由不同的GPU分别处理。此过程允许以并行方式在多个GPU上进行模型训练。与使用一个GPU进行所有训练相比,通过多GPU进行的模型并行化允许每步将更多图像馈入网络。模型并行化被认为比数据并行化更好。后面的一个将数据集分成多个批次,然后我们对每个批次进行训练。但是,当批次大小变得太小时,与批次梯度下降相比,我们实际上是随机的。这将导致收敛变慢,有时甚至变差。
The grouped convolutions become important for training very deep neural nets, as in the ResNeXt shown below
分组卷积对于训练非常深的神经网络非常重要,如下面的ResNeXt所示:
图 32. The image is adopted from the ResNeXt paper.
The second advantage is the model is more efficient, i.e. the model parameters decrease as number of filter group increases. In the previous examples, filters have h x w x Din x Dout parameters in a nominal 2D convolution. Filters in a grouped convolution with 2 filter groups has (h x w x Din/2 x Dout/2) x 2 parameters. The number of parameters is reduced by half.
第二个优点是模型更有效,即模型参数随着过滤器组数的增加而减小。在前面的示例中,filter具有标称2D卷积的h x w x Din x Dout参数。具有2个filter组的分组卷积中的filter具有(h x w x Din / 2 x Dout / 2)x 2个参数。参数数量减少一半。
The third advantage is a bit surprising. Grouped convolution may provide a better model than a nominal 2D convolution. This another fantastic blog (link) explains it. Here is a brief summary.
第三个优势令人惊讶。分组卷积可以提供比标称2D卷积更好的模型。这个另一个很棒的博客(链接)对此进行了解释。这是一个简短的摘要。
The reason links to the sparse filter relationship. The image below is the correlation across filters of adjacent layers. The relationship is sparse.
原因在于稀疏过滤器的关系。下图是相邻层过滤器之间的相关性。关系是稀疏的。
图 33. The correlation matrix between filters of adjacent layers in a Network-in-Network model trained on CIFAR10. Pairs of highly correlated filters are brighter, while lower correlated filters are darker. The image is adopted from this article.
How about the correlation map for grouped convolution?
分组卷积的相关图怎么样?
图 34. The correlations between filters of adjacent layers in a Network-in-Network model trained on CIFAR10, when trained with 1, 2, 4, 8 and 16 filter groups. The image is adopted from this article.
The image above is the correlation across filters of adjacent layers, when the model is trained with 1, 2, 4, 8, and 16 filter groups. The article proposed one reasoning (link): “The effect of filter groups is to learn with a block-diagonal structured sparsity on the channel dimension… the filters with high correlation are learned in a more structured way in the networks with filter groups. In effect, filter relationships that don’t have to be learned are on longer parameterized. In reducing the number of parameters in the network in this salient way, it is not as easy to over-fit, and hence a regularization-like effect allows the optimizer to learn more accurate, more efficient deep networks.”
当使用1,2,4,4,8和16个过滤器组训练模型时,上图是相邻层filter之间的相关性。文章提出了一个推理(链接):“filter组的作用是通过对角线结构稀疏性学习信道尺寸……在具有filter组的网络中,以更结构化的方式学习具有高相关性的滤波器。实际上,不必学习的过滤器关系就在较长的参数上。在以这种显着的方式减少网络中参数的数量时,过拟合并不容易,因此类似正则化的效果使优化器可以学习更准确,更有效的深度网络。”
In addition, each filter group learns a unique representation of the data. As noticed by the authors of the AlexNet, filter groups appear to structure learned filters into two distinct groups, black-white filter and color filters.
此外,每个filter组都会学习数据的唯一表示形式。正如AlexNet的作者所注意到的那样,滤镜组似乎将学习到的滤镜分为两个不同的组,即黑白滤镜和彩色滤镜。
图 35. AlexNet conv1 filter separation: as noted by the authors, filter groups appear to structure learned filters into two distinct groups, black-and-white and color filters. The image is adopted from the AlexNet paper.
Shuffled grouped convolution was introduced in the ShuffleNet from Magvii Inc (Face++). ShuffleNet is a computation-efficient convolution architecture, which is designed specially for mobile devices with very limited computing power (e.g. 10–150 MFLOPs).
The ideas behind the shuffled grouped convolution are linked to the ideas behind grouped convolution (used in MobileNet and ResNeXt for examples) and depthwise separable convolution (used in Xception).
随机分组卷积是在Magvii Inc(Face ++)的ShuffleNet中引入的。ShuffleNet是一种计算效率高的卷积体系结构,专为计算能力非常有限(例如10–150 MFLOP)的移动设备而设计。
重组分组卷积背后的思想与分组卷积背后的思想(例如在MobileNet和ResNeXt中使用)和深度可分离卷积(在Xception中使用)相关联。
Overall, the shuffled grouped convolution involves grouped convolution and channel shuffling.
总的来说,shuffled grouped convolution涉及分组卷积和通道混洗。
In the section about grouped convolution, we know that the filters are separated into different groups. Each group is responsible for a conventional 2D convolutions with certain depth. The total operations are significantly reduced. For examples in the figure below, we have 3 filter groups. The first filter group convolves with the red portion in the input layer. Similarly, the second and the third filter group convolves with the green and blue portions in the input. The kernel depth in each filter group is only 1/3 of the total channel count in the input layer. In this example, after the first grouped convolution GConv1, the input layer is mapped to the intermediate feature map. This feature map is then mapped to the output layer through the second grouped convolution GConv2.
在关于分组卷积的部分中,我们知道滤波器被分为不同的组。每个组负责一定深度的常规2D卷积。总操作量大大减少。对于下图中的示例,我们有3个filter组。第一个filter组与输入层中的红色部分卷积。类似地,第二和第三filter组与输入中的绿色和蓝色部分卷积。每个filter组中的内核深度仅为输入层中总通道数的1/3。在此示例中,在第一次分组卷积GConv1之后,将输入层映射到中间特征图。然后,此特征图通过第二个分组卷积GConv2映射到输出层。
图 36.
Grouped convolution is computationally efficient. But the problem is that each filter group only handles information passed down from the fixed portion in the previous layers. For examples in the image above, the first filter group (red) only process information that is passed down from the first 1/3 of the input channels. The blue filter group (blue) only process information that is passed down from the last 1/3 of the input channels. As such, each filter group is only limited to learn a few specific features. This property blocks information flow between channel groups and weakens representations during training. To overcome this problem, we apply the channel shuffle.
分组卷积在计算上是有效的。但是问题在于,每个filter组仅处理从先前层中的固定部分向下传递的信息。例如上图中的示例,第一个过滤器组(红色)仅处理从前1/3个输入通道向下传递的信息。蓝色过滤器组(蓝色)仅处理从最后1/3个输入通道向下传递的信息。因此,每个过滤器组仅限于学习一些特定功能。此属性会阻止频道组之间的信息流,并削弱训练过程中的表示。为了克服这个问题,我们应用了通道混洗。
The idea of channel shuffle is that we want to mix up the information from different filter groups. In the image below, we get the feature map after applying the first grouped convolution GConv1 with 3 filter groups. Before feeding this feature map into the second grouped convolution, we first divide the channels in each group into several subgroups. The we mix up these subgroups.
通道随机打乱的想法是,我们希望混合来自不同过滤器组的信息。在下图中,在将第一个分组卷积GConv1与3个filter组一起应用后,我们得到了特征图。在将此特征图馈入第二组卷积之前,我们首先将每组中的通道划分为几个子组。我们将这些子组混合在一起。
图 37. Channel shuffle.
After such shuffling, we continue performing the second grouped convolution GConv2 as usual. But now, since the information in the shuffled layer has already been mixed, we essentially feed each group in GConv2 with different subgroups in the feature map layer (or in the input layer). As a result, we allow the information flow between channels groups and strengthen the representations.
经过这种混洗后,我们将照常继续执行第二个分组卷积GConv2。但是现在,由于混洗层中的信息已经混合在一起,因此我们基本上将GConv2中的每个组与要素地图层(或输入层)中的不同子组一起提供。结果,我们允许信息在渠道组之间流动,并加强了表示。
The ShuffleNet paper (link) also introduced the pointwise grouped convolution. Typically for grouped convolution such as in MobileNet (link) or ResNeXt (link), the group operation is performed on the 3x3 spatial convolution, but not on 1 x 1 convolution.
ShuffleNet论文(链接)还介绍了逐点分组卷积。通常,对于诸如MobileNet(link)或ResNeXt(link)中的分组卷积,分组操作是在3x3空间卷积上执行的,而不是在1 x 1卷积上执行的。
The shuffleNet paper argues that the 1 x 1 convolution are also computationally costly. It suggests applying group convolution for 1 x 1 convolution as well. The pointwise grouped convolution, as the name suggested, performs group operations for 1 x 1 convolution. The operation is identical as for grouped convolution, with only one modification — performing on 1x1 filters instead of NxN filters (N>1).
shuffleNet论文认为1 x 1卷积在计算上也很耗费计算量。它建议对1 x 1卷积也应用组卷积。顾名思义,按点分组卷积执行1 x 1卷积的组运算。该操作与分组卷积的操作相同,只不过有一个修改-对1x1滤镜而不是NxN滤镜(N> 1)执行。
In the ShuffleNet paper, authors utilized three types of convolutions we have learned: (1) shuffled grouped convolution; (2) pointwise grouped convolution; and (3) depthwise separable convolution. Such architecture design significantly reduces the computation cost while maintaining the accuracy. For examples the classification error of ShuffleNet and AlexNet is comparable on actual mobile devices. However, the computation cost has been dramatically reduced from 720 MFLOPs in AlexNet down to 40–140 MFLOPs in ShuffleNet. With relatively small computation cost and good model performance, ShuffleNet gained popularity in the field of convolutional neural net for mobile devices.
在ShuffleNet论文中,作者利用了我们学到的三种类型的卷积:(1)进行分组卷积;(2)逐点分组卷积;(3)深度可分离卷积。这样的体系结构设计在保持精度的同时显着降低了计算成本。例如,ShuffleNet和AlexNet的分类错误在实际的移动设备上是可比的。但是,计算成本已从AlexNet中的720 MFLOP大幅降低到ShuffleNet中的40–140 MFLOP。ShuffleNet具有相对较低的计算成本和良好的模型性能,在用于移动设备的卷积神经网络领域中很受欢迎。
Thank you for reading the article. Please feel free to leave questions and comments below.
【附件】
See PDF for more details, Nice one!
留言发,因为某些原因无法传!
【参考】
https://towardsdatascience.com/types-of-convolutions-in-deep-learning-717013397f4d
点滴分享,福泽你我!Add oil!
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-11-22 21:57
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社