Training of SSD (Single Shot Detector) for Facial Detection Using Nvidia Jetson Nano

Saif Ur Rehman; Muhammad Rashid; Muhammad Hadi

doi:10.36648/2349-3917.21.9.95

Training of SSD (Single Shot Detector) for Facial Detection Using Nvidia Jetson Nano

Saif Ur Rehman^*, Muhammad Rashid and Muhammad Hadi

Department of Animal Breeding and Genetics, University of Agriculture, Faisalabad, Pakistan

*Corresponding Author:: Saif Ur Rehman
Department of Animal Breeding and Genetics
University of Agriculture
Faisalabad
Pakistan
E-mail: saifurrehman4114@gmail.com

Received Date: May 12, 2021; Accepted Date: May 26, 2021; Published Date: June 05, 2021

Citation: Rehman SU, Rashid M, Hadi M (2021) Training of SSD (Single Shot Detector) for Facial Detection Using Nvidia Jetson Nano. Am J Compt Sci InformTechnol Vol.9 No.6: 95.

Visit for more related articles at American Journal of Computer Science and Information Technology

Abstract

In this project we have used computer vision algorithm SSD (Single Shot detector) computer vision algorithm and trained this algorithm from the dataset which consists of 139 Pictures. Images were labeled using Intel CVAT (Computer Vision Annotation Tool). We trained this model for facial detection. We have deployed our trained model and software in NVIDIA Jetson Nano Developer kit. Model code is written in PyTorch deep learning framework. Programming language used is python.

Keywords

Computer vision; PyTorch; Software engineering; Deep learning

Introduction

We are using NVIDIA Jetson Nano Developer kit as our accelerator system. Which will contain Docker Container which will contain the dataset and trained model SSD (Single Shot Detector) MobileNetV2 which we will be used to for facial detection.

Video would be recorded through the Camera attached to the accelerator system. Code of the SSD (Single Shot Detector) MobileNetV2 is written in Python Programming Language and Deep learning framework which has been used is PyTorch. To optimized the neural network layers. Nvidia Tensor RT is used for faster Inference during the run time. NVIDIA Tensor RT is built on the NVIDIA CUDA for parallel computing [1].

Related work

Our Project is related to deep learning which have revolutionized the domain of AI. Subdomain of Machine Learning is deep learning. Deep learning means that you give data to the deep learning models to get the features of the data to train the model which could then be used for inference for unseen data that’s how the deep learning model is trained. Deep learning field made it’s when convolutional neural network was trained to categorize the images in the ImageNet LSVRC-2010 contest.

These days deep learning is being utilized everywhere from data science, big data, Image Classification, Image Segmentation, natural language processing, robotics, computer vision etc.

Deep learning models’ architectures have the input layer which is used to take the input data then the hidden layers which have perceptron’s whose nodes are interconnected to each other to extract the features from the input data then the output layer comes in which is used to output the result it is also called the inference layer [2].

In our project we are using SSD deep learning model which is faster and accurate with smaller dataset then the RNN and FRCNN. When we compare the FRCNN with SSD that SSD has a 76.9% mAP which was trained on PASCAL VOC cite. The Pasca65: online, COCO, and ILSVRC datasets individually to prove that the FRCNN which is 66% mAP trained on PASCAL is less accurate and faster.

Methodology

Marvin Minsky made the first attempt to copy the human brain more than 50 years ago, making further research into the ability of computers to interpret knowledge to make wise decisions. The method of automating image processing has resulted in the programming of algorithms over the years. However, although there was acceleration in deep learning methods, it was only from 2010 onwards. Google Brain developed a neural network of 16,000 computer processors in 2012 that could recognize images of cats. With the internet being a backbone, computer scientists have gained access to more knowledge than ever before. As costs of computer hardware continued to declined and improve. In the 1980s-90s, basic neural networks and algorithms emerged. The field of artificial intelligence, now more than half a century old, finally had its breakthrough moment at the.

The ILSVRC is an annual competition for image classification where research teams test their algorithms on the given data set, and then compete on multiple visual recognition tasks to achieve greater accuracy. A University of Toronto team then joined a deep neural network called Alex Net in 2012, making innovation for artificial intelligence and computer vision [3].

Tensor flow and pytorch

Famous deep learning libraries are tensor flow and PyTorch. Both tensor flow and PyTorch are open-source. Tensor flow is primarily established on Theano and has been originated from Google, while PyTorch is establish on torch and has been originated from facebook. Critical distinction among them is the way they define the computational graphs. Even as tensor flow builds a static graph, PyTorch builds a dynamic graph. PyTorch is more python favourite and AI models is easier to in it. However, for the usage of tensor flow you may need to take training. PyTorch changed in 2016 with the aid of facebook’s AI research lab. because it is normally intended for use in python, however it has a C++ interface. Tensor flow supports many programming. Tensor flow uses board for visualizing neural network. PyTorch on the other hand do not have a visualization feature. They uses Python packages for plotting. PyTorch is the standard deep learning framework library for researchers, tensor flow is preferred in commercial side. Tensor flow extensions for deployment on each server and smartphone make this the desired alternative for teams that work with deep learning.

Convolutional neural network

An artificial neural network was inspired from the network of neurons present in the human brain. At each layer of artificial neural network input data is weights of previous artificial neural network is summed. Features of the input is taken out and prediction is made in the final layer (Figure 1).

Figure 1: X-band MMIC power amplifiers.

The convolution operation specific to CNNs combines the input data from one layer with a convolution filter to make a feature map for next layer. CNNs for image classification are generally composed of an input layer (the image), a series of hidden layers for feature extraction (the convolutions), and a fully connected output layer (the classification). Deep learning relies on Convolutional Neural Network (CNN) models to transform images into predicted classifications. A CNN is a class of artificial neural network that is made of convolutional layers which extract the features from the input data, and is preferred network for image applications. As it is trained, the CNN adjusts automatically to find the most relevant features based on its classification requirements (Figure 2).

Figure 2: Convolutional neural network architecture.

Single shot detector: By using SSD, we only use one single shot to get objects in the image. On the other hand regional proposal network (RPN) based models require two shots. Implementing SSD in PyTorch for object detection, uses mobile net backbones. SSD was released at 2016 and made improvements in object detection tasks with high accuracy, reaching over 74 mAP at 59 frames per second on datasets such as Pascal VOC and COCO. SSD do object detection and classification of it in the single forward. Detector is used to detect and distinguish the objects. Confidence Loss means how much sure network is sure about the detected object in the bounding box. Location Loss how much network prediction is away from training set prediction. Features maps represents the major features in the Image. Multi box technique helps us to to detect the objects very clearly (Figure 3).

Figure 3: SSD architecture.

The SSD argue that data augmentation is very important to improve the accuracy of the model.

Jetson nano developer kit

Nvidia jetson nano developer board is very useful as it has GPU in it which helps us to run neural networks very fast for different applications in computer vision or robotics (Figure 4).

Figure 4: Jetson nano developer kit.

• GPU 128-core maxwell.

• CPU Quad-core ARM A57 1.43 GHz.

• Memory 4 GB 64-bit LPDDR4 25.6 GB/s.

• USB 4x USB 3.0, USB 2.0 micro-B.

• Display HDMI and display port.

• Serial communications protocols GPIO, I2C, I2S, SPI, UART.

• Board power rating 5V 2A, 5V 4A.

• Storage 64GB Solid State Drive (SSD).

• Operating system ubuntu 18.04.

A4tech webcam pk-810g

Anti-glare coating to avoid reflections that are disturbing. Capture images of great quality even under low-light conditions. With no aliasing, intelligent multisampling provides fluent video transmission [4]. Just plug it in and play, no installation of any software required. Intelligent Multi sampling. Snap shot button free webcam driver, USB 2.0, built-in microphone (Figure 5).

Figure 5: A4tech webcam PK-810G.

• Resolution: 480P, 640*480 pixels.

• Focus range: 60cm and beyond

• Built-in mic.: 1 mic.

• Output video format: MJPEG

• Frame rate: 30fps

CVAT

CVAT is an open-source tool for annotating digital images and videos. The main function of the application is to provide users with suitable annotation tools [5]. For that purpose, we designed CVAT as a handy service that has many great features. CVAT is a browser-based application for both individuals and teams that supports different work scenarios. The main tasks of supervised machine learning can be divided into three groups:

• Object detection

• Image classification

• Image segmentation

CVAT allows you to annotate image/video for each of these cases [6]. Here are some advantages and disadvantages of the tool.

Advantages: Web-based. In this technique Users don’t need to install the CVAT app; User just have to run the tool link in a browser if user want to create a task or annotate data [7]. They can create a public project and split the project work between other users. Easy to deploy. CVAT can be deployed in the local network using Docker. Deep learning deployment toolkit.

Disadvantages: Limited browser support. CVAT’s users works only on google chrome platform. CVAT is not working in other browsers, but it may work on Chromium based browsers like an opera browser [8]. All test has to be done manually, considerably slowing the development process (Figures 6 and 7).

Figure 6: A4tech webcam PK-810G.

Figure 7: Facial detection.

Results and Discussion

We trained this computer vision algorithm and deployed in the Jetson Nano. Training took around 2 hours. We were using it full performance. Hyperparameter were that we set the epoch at 5 and the iteration was of 2 in each epoch. There was 139 total data set, validation dataset was around 29, and the training data set was around 110. Learning rate was of 0.01 of base net layer and of extra layer. There were 3 classes for facial detection Saif, Hadi, and Rashid. Background is always added by default. Accuracy when tested was approximately 97.

Conclusion

You can see that for facial detection SSD (Single Shot Detector) algorithm is really a great algorithm for facial detection. It’s accuracy with small dataset and training very quickly makes it’s a very effective algorithm.

Our future work would be used it in an embedded system and utilize it for real world application and solve a real-world problem.