半监督物体检测数据库-仪器光电学院测控系B704实验室

目标检测已成为计算机视觉领域的研究热点，其中实例级物体检测在智能监控，视觉导航，人机交互，智能服务等领域有着重要的应用。受深度学习在人工智能中所取得的巨大成功的启发，研究者们试图将其应用于实例物体检测以提升检测性能。然而，深层卷积神经网络（Deep Convolutional Neural Network, DCNN）通常需要一个大规模有注释的数据集来监督它的训练，人工标注这样数量庞大的数据集既耗时又耗力，这阻碍了DCNN在物体检测领域的快速推广。因此，如何使用少量人工标注的样本和大量无标注的样本，训练出高性能的深层卷积神经网络以用于物体检测成为该项研究中的重难点。

Object detection has become a research hotspot in the field of computer vision, in which instance-level object detection plays an important role in intelligent monitoring, visual navigation, human-computer interaction, intelligent services and other fields. Inspired by the great success of deep learning in artificial intelligence, researchers try to apply it to improve the performance of object detection. However, Deep Convolutional Neural Network (DCNN) always needs a large-scale annotated data set to supervise its training. Manually annotating such a big data set is time-consuming and laborious, which hinders the rapid promotion of DCNN in the field of object detection. Therefore, how to utilize a small number of labeled samples and a large number of unlabeled samples to train a effective DCNN for object detection has become a major difficult point in this research.

本实验室利用协同训练的方法实现了基于半监督学习的目标检测算法。在此过程中本实验室建造了自己的半监督实例物体检测数据库BHID-TOOL Dataset，包含从复杂背景中，多个视角下获取的4个外貌上差异较大的实例物体的RGB图像。这四个实例物体是工具中很常见的物体，包括小车模型，飞机模型，胶带，钳子。BH-TOOL数据集包含13个自然场景视频序列，其中9个序列（包括6302帧）用于训练，4个序列（包括2400帧）用于测试。相机记录的图像分辨率为352×288。通过在每个场景中走动时将摄像机保持在近似人眼高度上记录视频序列。实例物体被置于不同的距离，光照，视角和背景条件下被拍摄。在6302张训练图像中，人工标注了其中的关键样本315张，BHID-TOOL Datset中所有数据附在以下附件中：

BHID-TOOL Dataset contains RGB images of 4 physically distinct objects taken from multiple views in complicated background. The four chosen objects are commonly found in tools, including car model, helicopter model, tape and pliyer. BHID-TOOL dataset contains 13 video sequences of natural scenes, in which 9 sequences(including 6302 frames) are used for training and 4 (including 2400 frames) are used for test. The camera records color images at 352*288 resolution.The video sequences were recorded by holding the camera at approximately human eye-level while walking around in each scene. The objects are visible from different distances, illumination ,viewpoints and background. In all these 6302 training samples, 315 frames are manually annotated. All the data in BHID-TOOL Dataset is compressed in the attachment.