Title
On the purity of training and testing data for learning: The case of pedestrian detection.
Abstract
The training and the evaluation of learning algorithms depend critically on the quality of data samples. We denote as pure the samples that identify clearly and without any ambiguity the class of objects of interest. For instance, in pedestrian detection algorithms, we consider as pure samples the ones containing persons who are fully visible and are imaged at a good resolution (larger than the detector window in size). The exclusive use of pure samples entails two kinds of problems. In training, it biases the detector to neglect slightly occluded and small sized samples (which we denote as impure), thus reducing its detection rate in a real world application. In testing, it leads to the unfair evaluation and comparison of different detectors since slightly impure samples, when detected, can be accounted for as false positives. In this paper we study how a sensible use of impure samples can benefit both the training and the evaluation of pedestrian detection algorithms. We improve the labelling of one of the most widely used pedestrian data sets (INRIA) taking into account the degree of sample impurity. We observe that including partially occluded pedestrians in the training improves performance, not only on partially visible examples, but also on the fully visible ones. Furthermore, we found that including pedestrians imaged at low resolutions is beneficial for detecting pedestrians in the same range of heights, leaving the performance on pure samples unchanged. However, including samples with too high a grade of impurity degrades the performance, thus a careful balance must be found. The proposed labelling will allow further studies on the role of impure samples in training pedestrian detectors and on devising fairer comparison metrics between different algorithms.
Year
DOI
Venue
2015
10.1016/j.neucom.2014.09.055
Neurocomputing
Keywords
Field
DocType
Sample purity,Pedestrian detection,Machine learning,Partial occlusion,INRIA person data set,Labelling
Pedestrian,Data set,Pattern recognition,Test data,Artificial intelligence,Ambiguity,Pedestrian detection,Detector,Mathematics,False positive paradox
Journal
Volume
ISSN
Citations 
150
0925-2312
3
PageRank 
References 
Authors
0.43
30
3
Name
Order
Citations
PageRank
Matteo Taiana1393.68
Jacinto C. Nascimento239640.94
Alexandre Bernardino371078.77