In this project we implement a face detection program using SIFT-like Histogram of Gradients(HoG) based on Triggs's paper.
We also test on some picture preprocessing techniques and extra positive training sets for better performance. Details are explained below.
The whole program mainly consists of the following steps.
We have extracted 6713 positive features (faces) from Caltech Web Faces dataset and extracted in total 50000 random negative features (non-faces) from SUN dataset.
We use linear SVM (vl_svmtrain) with regularization parameter (lambda) as 0.0001 to obtain a linear classifier.
We used multiple scale sliding windows (0.05:1.2:0.05) to detect images.
To evaluate the effects of different steps(HoG cell size) on test results, method of control variates are used.
The following results are obtained with HoG_Template_Size=36, Confidence_Threshold=-0.5.
Hog Cell Size | Cell Size = 6 | Cell Size = 4 | Cell Size = 3 |
HoG | ![]() |
![]() |
![]() |
Average Precision | ![]() |
![]() |
![]() |
Recall(Viola Jones) | ![]() |
![]() |
![]() |
Sample Result | ![]() |
![]() |
![]() |
It turns out that the detection results become better as HoG cell size gets smaller. However, the total running time also increases dramatically. It is safe to conclude that there exists a tradeoff between average precision and running time.
To refine our classifier, we implement the method of hard negative mining, which includes the following steps:
The following results show the improvements hard negative mining has on the SVM.
(HoG_Template_Size=36, HoG_Cell_Size=3, Confidence_Threshold=-0.5)
Hard Negative Mining=OFF | Hard Negative Mining=ON | |
Average Precision | ![]() |
![]() |
Recall(Viola Jones) | ![]() |
![]() |
Sample Result | ![]() |
![]() |
Hard negative mining does improve the performance a little bit. However, it also increase the training time.
We search for extra face dataset and find Labeled Faces in the Wild(LFW) dataset from UMASS. We select around 8000 extra face images and resize them to 36*36, then mix up with the caltech faces dataset. Finally, we divided the dataset into two new datasets with each contains around 10000 faces images. The results below illustrates performance of each datasets.
(HoG_Template_Size=36, HoG_Cell_Size=3, Confidence_Threshold=-0.5, HNM=OFF)
NewFaceSet | NewFaceSet2 | |
HoG | ![]() |
![]() |
Average Precision | ![]() |
![]() |
Recall(Viola Jones) | ![]() |
![]() |
Sample Result | ![]() |
![]() |
In general, our NewFaceSet2 performes better than NewFaceSet. After rough inspection of these two dataset, we find that NewFaceSet contains many pictures that are the same face but from different directions, that is probably why the HoG image of NewFaceSet is not so face-like.
In search for better recognition, we look into faces that can not be detected and apply different image augmentaion skills including:
Unfortunately, none of the techniques mentioned above have noticeable improvement on the recognition result.
(HoG_Template_Size=36, HoG_Cell_Size=3, Confidence_Threshold=-0.5, HNM=OFF)
Contrast Stretching | Flipped Face | Downsize Negative Samples | |
Average Precision | ![]() |
![]() |
![]() |
The best average precision we obtain is 0.937, under the following conditions:
HoG_Template_Size=36, HoG_Cell_Size=3, Confidence_Threshold=-1.1, HNM=ON
HoG | Average Precision | Recall(Viola Jones) | Sample Results | |
Best Performance | ![]() |
![]() |
![]() |
![]() |
However, due to the relativly low confidence threshold, there are many false positives.
HoG_Template_Size=36, HoG_Cell_Size=3, Confidence_Threshold=0.95, HNM=ON
The END