P0.1 Indian Face detection and recognition

Updated: Nov 10, 2021

Most of the work done in face recognition doesn’t work well on Indian faces. In this project, a quick approach is used for Indian face detection and recognition.


Roboflow is a great tool for manually making boundary boxes in the images. A screenshot from Roboflow is shown below. It is for face detection. Collected around 3000 Indian face images from google image. Labelling it using Roboflow API. Labelling took me around three days as the datasets are big. These datasets are only used for the training and validation set. All the images are resized to 416 x 416. All the images are generated as a final dataset. The final dataset needs to be exported in TensorFlow TFRecord format. It stores the sequence of binary records. The TFRecord format dataset will be needed for object detection using TensorFlow Object Detection API. A lot of work is being done for improving Roboflow. Augmentation is also supported. Please, go through this blog to learn more about Roboflow.


TensorFlow object detection

The project needs quick implementation for face detection. Two options were present. One is using PyTorch YOLO 5 or TensorFlow object detection API. For this project, TensorFlow object detection API is preferred for training and inference on edge devices. It is an open-source project built on top of TensorFlow. It works with TensorFlow 2 (TF2). It has also support for SSD with MobileNet, RetinaNet, Faster R-CNN, Mask R-CNN. It Uses a recent family of SOTA models called EfficientDet along with COCO pre-trained weights for all of the models provided as TF2 style object-based checkpoints. It also gives access to distributed training. Followed the Colab code-shared by TF OD API. Used less than 5 Indian face images for training the model using TF OD API.

For training using TF OD API:

!python model_main_tf2.py \
    --pipeline_config_path=/FacRec/pipeline.conf \
    --model_dir=/FacRec/model \
    --checkpoint_every_n=1000 \
    --num_workers=4 \


Comparison of SSD and YOLO
Comparison of SSD and YOLO

SSD stands for single-shot detection whereas YOLO stands for You Only Look Once. An image is a feed to the YOLO to learn boundary box coordinates along with their probability. YOLO is very quick in prediction as compared to SSD. YOLO divides the image into equal grids. YOLO assigns each bounding box with [Pc, bx, by, bh, bw, C1, C2] vector. Each grid is having its accuracy. Most of the bounding boxes are eliminated by setting the threshold. Intersection over Union (IoU) and Non-maximum Suppression (NMS) are used for elimination. NMS works by suppressing the boundary boxes with less accuracy and more IoU with the highest boundary box for an object in an image.

YOLO: 608 x 608 x 3 ==> 19 x 19 x 5 x 85

19 x 19 => Grid

5 => Anchor box or grid

80 => Classes

5 => [Pc, bx, by, bh, bw]

YOLO generates fewer boundary boxes compared to the SSD. YOLO generates 7x7x2 = 98 bounding boxes where as SSD generates 5776 + ... + 4 = 8732 total bounding boxes. The formula for calculating BB in SSD is

m x m x p ==> m x n x k x ( c + 4 )

m x n => feature map of previous layer

p => previous channel

k => bounding boxes

c => class score (with background class 1)

4 => offset values

In SSD, a convolution runs across the image for a single time to generate a feature map. SDD is a much-preferred object detection algorithm based on inference time and accuracy. SSD generates more boxes using different shapes of anchors as the object can be in different shapes. For example, detecting a man and a car. For both, different shapes of anchors will be required. Here is an explanation for the dimension of the 300 x 300 x 3 image and the boundary box generated using SSD. The image is passed through the VGG-16 network to produce a 38 x 38 image with 4 boundary boxes (BB). The generated image is passed through the 19 x 19 with 6 BB. The image is further passed through 10 x 10 with 6 BB. The image is further passed through 5 x 5 and 3 x 3 with 4 BB. The image is finally passed through 1 x 1 convolution to generate 8732 BB per class. Finally, NMS is implemented to generate the final BB of the object in the image.

Face Recognition

It uses transfer learning for mapping to the Indian face features. Used state-of-the-art approach for face recognition and generating embedding of 195 vector size of each person. Trained the ResNet50 model on a face dataset containing a different variety of images. Used the trained model to train the 3000 Indian face images using transfer learning with the same approach.


  1. https://arxiv.org/pdf/1512.02325.pdf

Running Dog