作者：Ross.B.Girshick(RBG大神), Jeff Donahue, Trevor Darrell, Jitendra Malik

发布时间：2014

**发布期刊：**CVPR

论文全称：Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

论文地址：https://openaccess.thecvf.com/content_cvpr_2014/html/Girshick_Rich_Feature_Hierarchies_2014_CVPR_paper.html

Matlab代码：https://github.com/rbgirshick/rcnn

论文Slides：https://dl.dropboxusercontent.com/s/bpi3vd7gia9f6ul/rcnn-cvpr14-slides.pdf?dl=0

论文海报：https://dl.dropboxusercontent.com/s/tzefwijlstpapl1/rcnn-poster.pdf?dl=0

论文附录：https://dl.dropboxusercontent.com/s/1yisyl5cuxo7g9y/r-cnn-cvpr-supp.pdf?dl=0

地位：R-CNN是两阶段深度学习目标检测算法的开山奠基之作，首次将深度学习和卷积神经网络用于目标检测并取得显著性能提升

一、摘要

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specifific fifine-tuning, yields a signifificant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: *Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/˜rbg/rcnn*

**背景：**目标检测这个任务在VOC这个权威的数据集上的性能表现已经达到了瓶颈
本论文的贡献：
1. 使用CNN自底向上提取候选框的特征，将这个特征用于定位、分类或分割
2. 迁移学习，即现在辅助任务（ImageNet）上预训练模型，然后再在指定任务（VOC）上微调，达到了性能的大幅度提升
R-CNN: Regions with CNN features(region proposals with CNN)
需要解决的问题：
1. 如何使用一个深层网络来对目标进行定位——R CNN通过提取2000个与类别无关的候选框，再将候选框强制缩放成固定大小，输入到CNN提取固定大小的特征来解决。
2. 如何使用很少的带标注的数据来训练一个高表达能力的模型——R CNN通过先在一个大规模的图像分类数据集（ImageNet）上进行监督预训练，再把预训练的模型拿到特定领域的数据集上进行微调训练（其实就是迁移学习）
当时已有的解决方法：
1. 将目标检测问题当做回归问题来解决，经过作者的实验，发现效果并不佳（当时Yolo还没出世）
2. 使用滑动窗口的方法，但是通常需要特别的类别，如人脸、行人，并且需要高分辨率（即网络无法很深）
**R-CNN不适合用滑动窗口的原因：**在R CNN中，卷积层、下采样层变多，最后输出的feature map是一个值，对应原图的感受野和步伐都很大，即与原来滑动窗口的假设——空间分辨率很大 是违背的

一、 摘要

一、摘要