10 min read

How I built face recognition SDK using Computer Vision

How I built face recognition SDK using Computer Vision
Photo by Christopher Campbell / Unsplash

This is the summary of the talk by Sandeep Giri presented at Git Commit Show 2019.

About the speaker

We have Sandeep Giri, founder of Deja view. He graduated from IIT Roorkee. In this session, he will explain about how he built a face recognition SDK using computer vision.

Summary of the talk

First of all, let me introduce what exactly the Deja view is. It is an SDK for computer vision, so we are taking out the pain of doing machine learning deep learning for an engineer or a developer for an app developer.  So that they can just quickly focus on their use cases. Let me go with the flow of the talk. The talk is going to be structured around-

  • I'll introduce myself
  • I'll talk about Deja view
  • Show you how it works
  • The architecture details


What exactly is the key ingredient for machine learning or AI? I'm going to talk about the DevOps or the architecture part of it. So it's heavy engineering work and with this knowledge, you will be able to build a lot of great stuff. So I'm Sandeep Giri, I'm the founder of Deja view and cloud lab. I long back graduated from IIT Roorkee have worked on large scale computing as well as machine learning at Amazon and Inmobi show and started my own company before cloud lab called t-bits global and then I love software engineering and I love explaining technologies and that's the reason why I started cloud lab and that's the reason why I'm speaking on git commit show.


The purpose of the Deja view is to make computer vision, meaning the ability of the computer to see to perceive, that's one of the key sub-objectives of AI. So the purpose of the Deja view is to make the computer vision as SDK or an API so that the developers don't have to do all the heavy lifting of deep learning and they don't do all the DevOps and deployment or scaling work and it requires the knowledge of ML. It's all they don't need to maintain or create such kind of infrastructure so that's the idea essentially. In Deja's view, the main idea is to make the vision part of AI pluggable so that you can plug and play the AI part. So for building an AI app, what you need is a powerful machine learning model, huge infrastructure, and expertise in the domain of scaling as well as in the domains of machine learning. So the existing solutions require the developers to know about what ML is, how to scale it, how to deploy it, and so on.  At the same time, it requires the developer to launch a lot of infrastructures. It has to have various kinds of services and so on. I'll go into those details at the same time. The existing solutions are not offering that great precision so I along with Praveen and Abhinav have started Deja view and we're focusing more on deploying it for free in the beginning. Let me just give you a demo so that you can get an idea of how it works. So imagine that you are running a shop and if you go to the website it'll ask you to sign up. So this is my shop right, so when a new customer or new user is coming to my shop I'll just say capture and it'll recall if the person exists in the database. So I exist in the database in these photos so it has recalled that part, so you can try setting up your shop and try to store the information for the first time, and then it will be able to recall who it was. So in this way you can easily build a zero-touch front desk for your company, shop, building, school, and so on. It keeps on storing the data, you can keep on storing the details about the person who is walking in if the person is not found in the database, it asks you to store the person in the database and then all you have to do is just click on capture and the person's photo will be captured and recalled. This way you don't need to remember everybody's face or and you don't need to rely on memorization of everybody's face. So this is extremely useful for many use cases like for your offices, you can build a front desk very quickly. At the same time, if you are running a shopping store or a supermarket, you can build the loyalty programs around the face and if you have for the hotel industry, it could remember that the person came on such and such day and these many visits have been there based on the face.  In this way, you'll be able to customize the experience of the individual whether this person likes the AC to be cool or to be high and the temperature of baths should be this much, and so on. So you could do a lot of customizations without bothering about the individual. At the same time, you could do the classroom attendance and also build your solution using the SDK provided by data view and you can fit in new use cases as PRS.  The way it works is, first of all, there is a user authentication then once the user authentication is successful this is happening from the API end so your API is going to our website doing the authentication and then uploading a picture and then the server is internally running the neural network, extracting the face embedding and then here get the existing faces from the database and figure out which one is the closest and then find the closest face. Now, the main problem is that this part takes a long time to load and do the inference, and then the second part that is the maximum time is taken by this part. Combined, it usually takes more than three to four seconds, but what we have done is by clever optimization of these two things: we have brought down the overall time of finding the closest face based on the image and we have brought it down below one second. So that is something that we have done. It's around 0.3 overall seconds.  This was taking longer than 5 seconds, so we brought it down to very small even in the worst-case scenario, it's going to do pretty well. I'm going into more details about the architecture but before that, I would like to explain to you what exactly is embedding? The way embedding works is that essentially embedding is generated by a neural network, neural network takes a photo and converts it into numbers. The way these numbers are produced usually is by the way of a neural network. Specifically, there's a neural network called face net which is the neural network under the hood. So what face net does is it takes the photo, converts it into these hundred numbers.  Just like my height can be represented by one number but my running speed or my general abilities are very hard to be represented in a single number.  For example, just by looking at my marks in math, you cannot judge me. Therefore to judge me, you need to know a lot more about me. Similarly to judge your face, you need more characteristics like around 100 or 120. It's hard to answer now whether 64 is good enough, 50 is good enough or 128 numbers are good enough but for this discussion let's just keep it 128 numbers because that's what was done. Another question that you guys might be having is: where do these numbers come from? Are they representing the size of eyes? Are they representing the sides of my chin? What do these numbers represent?


The answer is that nobody knows, not even the people who built the face net. They also don't know what they did as they built this optimization system by the way of machine learning that generates those hundred numbers and generates those hundred numbers which would be close by to each other. If the two faces are the same, these hundred numbers will be far away from each other. If it is shown the photos of a different person, then come up with those hundred numbers which would be close by for the same person's face close by as if it were a hundred numbers. Then do a cartesian distance and that's the distance between the two numbers. So for the person and the same person's face, this distance should be very small, and for different persons face this distance should be big. So the idea was to figure out these numbers to represent the faces.


They gathered lots of faces of one person as well as lots of faces for other people. So many people's faces gather along with multiple photos at multiple ages of the same person varying different things and using that number a neural network was built which could convert the face into these numbers. This is called face Embedding.


This is another example of face embedding and probably in the previous talk, somebody was demoing the idea of embedding using don't touch your face. It's the same I'm waiting for here, also we call it to transfer learning, pre-trained neural networks, and so on. So the main idea here I'm trying to show you is that let's say this is Sandeep Giri on the right-hand side. On the left-hand side, let's say there are actors. Now I want to find out which actor is closest to me. All I have to do is convert each of these spaces into these numbers and compare this against these. So either you could do a comparison or you could again utilize the machine learning models such as KNN or SVM to find out which one is closest to my face. Let's just compare these numbers against each of these faces and I found that there's somebody called Suraj Sharma whose embeddings are closest to my embeddings and he looks like me when I was younger. This is how embeddings work. So converting face in 100 numbers is what embeddings is about and it's called transfer learning. The most important part is to understand that a face can be converted into harder numbers. The way a web application works is that users are making the API calls, if it goes to the web application in our case of Deja view we are using Django python Django and then there is a sync call process midway because this guy can't keep waiting till the process finishes. So he drops the message here saying " your work is in progress" and then it keeps on pulling here to observe when the work finishes this async call process. On one hand, talks to this service called image embed which converts the image to embedding so this keeps running parallelly inside the docker instance which is running the Kubernetes cluster. So this way as the data increases, this will keep on scaling and therefore it is infinitely scalable. On the other hand, there is a database of the users which is already holding the embeddings of the users. So one call goes to the docker asking for embedding and another call goes to the NoSQL data store to bring the information about users and then notice that we have made it parallel and you can also make it parallel. Now that essentially brings it down to fifty percent. At this point, this process would go and become a big overhead. Therefore this will again have to be scaled just like this. So by the way rocker and Kubernetes can scale it along with NoSQL could scale it to billions of users without too many problems. This matching again is done using the Tensor Flow schools and that's how it works. So that's all about machine learning and my project, Deja's view.

Now it's time for questions and answers.

Q1. What was the most important decision that brought down the time from 5 seconds to 0.3 seconds?

By converting the deep learning, whole neural network from a web application into a service, that was the most important decision taken that brought down the time drastically. Something that was taking more than a second came down to 0.3 seconds and suddenly it was taking around three to four seconds and it brought it down to 0.3 and essentially understanding how does the neural network works was the most crucial part of it and when you have these modular components, what you can do is this part is running in python or A. So wherever you feel that the process is going to take more time in python, you can just use a Tensor Flow or other APIs and use that as a service. So developer engineering is crucial in the case of even machine learning models because machine learning models, you are not going to improve it and they are going to be there but what you can work on is that if you understand the engineering around it, you can have a lot of the problems. If you see the progression of most of the machine learning models, most of the time it's the engineering, not the neural sciences.


Q2. Now we have the Deja View SDK which people can use if they want to implement face recognition. It's a high-performing SDK and what are these resources which are needed to use this SDK like what kind of machine do I need. Do I plug in an AWS machine? or do I plug in a GPU? or how does it work from the users of the AWS perspective?

If you have a web app or mobile app just like the way you embed google analytics, in the same way, you can embed a Deja view into your map mobile app or web app. You don't need any heavy lifting on your side. You can focus on your business logic and you can leave the face recognition and other AI to the Deja view.

Q3. We have a theme for this conference pursuit of mastery and you have a great track record of working at different organizations. Building machine learning algorithms, machine learning projects, and teaching people about machine learning. What is your learning related to this topic?

If you have to do well in machine learning, most of the time the machine learning is about good engineering. My recommendation to probably every engineer out there would be to focus on data structures algorithms. At the same time focus on learning Linux well and I mean along with that basic math. You should be good with these and then you will see that there will be a lot of hope doors open. So even though the bigger things like DevOps, Deep learning, might look complex and big things but if you master the basic components such as you should be good with data structures, algorithms, Linux, and so on. You don't have to panic that you are left behind so much. The thing is that those fundamentals are something that makes a big difference even in the case of Deja's view. Fundamentals play the key role because once those fundamentals are clear you will be able to learn anything. For example, learning docker and Kubernetes requires a good knowledge of Linux. If you are good with basic coding and python and turning data and so on, you'll be able to pick up the machine learning application point of view very quickly but if you're good with math and how the algorithms work, you will be able to quickly pick up the entire science behind the machine learning. So my recommendation would be to become good with basics and then go with these learnings.