作者：Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

发布时间：2014

发布期刊：ECCV

论文全称：Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

论文地址：https://arxiv.org/pdf/1406.4729.pdf

代码：*http://research.microsoft.com/en-us/um/people/kahe/*

地位：提出了空间金字塔池化

一、摘要

Existing deep convolutional neural networks (CNNs) require a fifixed-size (e.g., 224×224) input image. This requirement is “artifificial” and may reduce the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with another pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fifixed-length representation regardless of image size/scale. Pyramid pooling is also robust to object deformations. With these advantages, SPP-net should in general improve all CNN-based image classifification methods. On the ImageNet 2012 dataset, we demonstrate that SPP-net boosts the accuracy of a variety of CNN architectures despite their different designs. On the Pascal VOC 2007 and Caltech101 datasets, SPP-net achieves state-of-the art classifification results using a single full-image representation and no fifine-tuning.

The power of SPP-net is also signifificant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fifixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method is 24-102× faster than the R-CNN method, while achieving better or comparable accuracy on Pascal VOC 2007.

In ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014, our methods rank #2 in object detection and #3 in image classification among all 38 teams. This manuscript also introduces the improvement made for this competition.

已有的卷积网络都要求输入图像固定大小（例如224×224），但这需求是人为的，这会降低对任意大小/比例的图像的识别精度
作者提出空间金字塔池化（Spatial Pyramid Pooling）来解决该问题，该池化策略对物体变形也有较好的鲁棒性（robust）
而根据RCNN和Spatial Pyramid Pooling构建成的网络架构，称之为SPP-net

二、研究背景

普遍的cnn需要固定的输入图像大小，这样的缺点有：

限制了输入图像的宽比和图像的比例
通过剪切来将输入图像变换成所需大小：裁剪后的区域可能不包含整个instance（下图左所示）
通过缩放来将输入图像变换成所需大小：导致不必要的几何失真（下图右所示）
这会因为目标缺失和图像失真而导致识别率下降

即固定输入大小忽略了图像比例的问题

Untitled