This machine learning application was built by Adiba Naaz, Aasema Kauser, and Amena Anjum.
Business Use Case
Language has always been a barrier for people who lack in listening and speech, both in a public place or a business place, they find it challenging to convey a message, express their thoughts, or make others understand what they want to say. Talking or expressing themselves to their community hasn’t been a problem, but when it comes to other people, it’s a major issue that they face daily. These people use Sign Language as their primary means to express their ideas and thoughts in their community and with other people with hand gestures. However, the problem is that not everyone learns sign language and learning takes time and effort and is sometimes discouraging also. If developing an AI model that takes hand gestures as input and translates gestures into text or short sentences, the language barrier would be resolved.
Thinking of places where sign language could benefit this community when put in use, there are a few places that come to mind like Restaurants, stores, supermarkets, etc.
If the business owner understands the problem and would adopt technology, which translates gestures into words/short sentences, this accessible experience would draw more customers and can gain more profit.
In this project, we attempted to build an AI model that takes hand gesture input and translates it into text.
We have created our own dataset by capturing images of three classes “A”, ”B”, and” C” as prove of concept, roughly around 14 MB of image data with 400 images per class, a total of 1,200 images.
In this project, we followed the standard data science workflow:
- Data collection
- Data pre-processing and feature extraction
- Model training and evaluation.
Note that the notebooks for each step above have been provided in the A360 GitHub repo.
1. Data Collection
We have captured images using the Google Teachable Machine and the OpenCV library.
2. Data Preprocessing and Feature Extraction
The idea here is to segment out the object from the background. We have used a simple threshold technique where a standard threshold value is set and each pixel value is compared with the threshold value. If the pixel value is less than the pre-set threshold value then the value is set to 0, or else it is set to the maximum value. Now for better performance in poor lighting conditions instead of calculating the global threshold value we have used the adaptive thresholding technique, in which the threshold value gets calculated for smaller regions which account for various threshold values for the surrounding areas. The input image for the above technique requires a binary image i.e grayscale image so we converted the image into grayscale. Now the problem here is that manually selecting the threshold value may sometimes go wrong, to avoid that we have used Gaussian Filter-based Ostu’s thresholding technique to automatically calculate the threshold value. For implementing this technique the grayscale image has to be primarily blurred using the Gaussian filter.
Gray Image (A)
Gray Image (B)
Gray Image (C)
Clear Image (A)
Clear Image (B)
Clear Image (C)
As we see the above images are a bit noisy so, to remove noise we applied morphology closing operation. The closing operation dilates an image and then erodes the dilated image.
3. Model training
After pre-processing images, our training data was ready for model training. Here we built a convolution neural network model from scratch. We further split the training data into train/test with 80/20 ratio. The trained model for 100 epochs.
Training and validation accuracy
Training and validation loss
As we can see, training loss is gradually decreasing with increase in the number of epochs, and also validation loss is decreasing at the same rate as the training loss.
4. Real-Time Webcam Prediction
There are a few limitations of our model.
- Model works well in good lighting conditions.
- Plain background is needed for the model to detect.
6. Custom YOLOV5 Model
To make a more accurate and business-specific model, we created a custom object detection YOLOv5 model for words/short sentences like, YES, NO, REDUCE PRICE, GOODBYE, and NICE.
We would like to mention an example of a restaurant named ‘ISHAARA’ in Mumbai, India, where owners are working for this deaf and mute community as they have hired hearing and speech impaired staff. The conversation happens like they have made customized menu cards wherein they have assigned particular signs for each cuisine. So, in this way, they are helping this community by providing jobs and making them feel equal in society. But this consumes more time and the process of communication may be slow. To make it more effective and time-saving, we have come up with this effective idea that would make work easier and can save time.
Here are the images of the YOLOv5 model that we have made to make it more business specific.
7. Real-Time Webcam Predictions
Images below are snapshots of Real-Time webcam predictions using YOLOv5 Model.
Here we showcase a business use case, a Sign language recognition system for the deaf and mute that would help to bridge the communication gap. In business, communication plays a very important role and time is gold. By implementing this project we will save a lot of time and make communication more effective.