Learning the ASL Alphabet with Computer Vision and Flask

Ryan Belfer
7 min readJan 9, 2022

Skip to the 19-minute mark in the video for the actual demo!

Code for this project can be found on my GitHub.

I have been learning (and unfortunately unlearning without practice!) foreign languages for the past 15 years; from Spanish in high school, to Japanese in college, to Norwegian/Swedish just for vacation, there’s something monumentally satisfying in gaining the ability to converse with a whole new population segment and in better understanding a little piece of that culture. In the learning process, the English alphabet makes a great segue to Latin and Germanic languages, in which an English speaker can sound out the words even before understanding the meaning. Yet, the deaf community holds a unique distinction of having only a visual form — being able to “write” means being able to “speak”.

Growing up, I didn’t have any interest in learning sign language. I didn’t personally know any profoundly deaf people, and the pop culture I consumed didn’t prominently feature the deaf. Then in the past 5 or so years, a sudden explosion of deaf representation occurred: A Quiet Place, Dark (my favorite Netflix show!), the horror movie Hush, and most recently Hawkeye, Echo, and Makkari in the Marvel Cinematic Universe. I realized I had an opportunity to learn a new language, one that shared syntax with English. American Sign Language (as there are hundreds of sign languages, even different ones for English depending on the country), or ASL, can be learned by anyone, and the easiest place to start would be learning the alphabet. This presents me with the opportunity to learn more about computer vision, an area of data science to do with interpreting image data, in order to meet my goal.

The Iterative Design Process

Because this was a computer vision project, my initial thought was to use a Convolutional Neural Network, or CNN, to classify the images into letters. The “images” would actually be frames from my webcam, such that I could write on top of the video in real-time what the prediction would be. I knew I wanted to use Python’s opencv library to get the data, and then to train on a CNN using tensorflow. My first iteration went a little something like this:

  1. Find the resolution of the webcam, and draw an appropriate static bounding box. Do the sign in the box.
  2. Snap an image of the box and convert it to greyscale, reducing the number of color channels from 3 to 1. I could also put in a small amount of noise or rotate the image slightly to add variance.
  3. Repeat this a few hundred times per letter, then train a CNN model and post the prediction on the screen.

This method had obvious drawbacks, the most obvious being that a static bounding box would make it so that the signing is confined to a small region of space. Additionally, people with different resolutions could find this difficult, as my integrated laptop webcam only had a resolution of 640 x 480. Enter MediaPipe, a huge Python library full of goodies that allow for object detection, object tracking, image segmentation (classifying each pixel as “object” or “not object”), and best for this project, a way of tracking hands down to the specific hand parts. So, my second iteration went something like this:

  1. Find the resolution of the webcam for resizing purposes. Track the hand and create a bounding box around the hand based on the minimum and maximum x and y coordinates returned from mediapipe (and creating some padding).
  2. Snap an image of the box and resize it to an arbitrary value, like 128 x 128 pixels, then convert to greyscale.
  3. Repeat this a few hundred times per letter at different distance from the webcam and angle of hand, then train a CNN model and post the prediction on screen.

So I went through with this and got results. The CNN was able to recognize the letters well enough, but I only trained on letters in “Hello World”, and realized that even with the letters “r” and “d”, there was some confusion. With 26 letters, it would be much harder to classify. Finally, I realized that I was getting the x and y coordinates of the finger joints and palm anyway, so why not use those as my data? With the wrist point as the baseline, I could calculate distances to the different points on the hand, and that should be enough to differentiate the signs!

Using the coordinates ended up being my final design. At first I used the Euclidean distance, but ended up needing to separate the x and y distances. Here was the final process:

  1. Find the resolution of the webcam for resizing purposes. Track the hand.
  2. Snap the x and y coordinates of the different points on the hand, do some math including the resizing, and put the results with the label into a pandas dataframe (like a table in Excel).
  3. Repeat this a few hundred times per letter at different distance from the webcam and angle of hand, then train a classifier model and post the prediction. Ultimately I chose a random forest, training on 80/20 train/validation split, stratified by letter. The test data would be the signs I made after using the model to predict later live webcam video. The training accuracy was over 99% and the validation accuracy was over 97%.

Two quirks of ASL worth mentioning are that letters J and Z require movement, so I used the ending position for classification, and that for double letters like in “letter” or “book”, to sign the second letter, the signer slides their hand left or right a little. For double letters I had to add a condition of whether the user moved their hand.

Practicing Sign Language on a Flask App

With the model complete, I wanted to be able to share my app with the world by deploying it on Heroku (spoilers: because the camera detection looks for a webcam on the host, and Heroku’s server neither is the user’s computer nor even has a webcam, this didn’t work), or even deploying locally using Flask (spoilers: this works!). Flask is a simple Python framework, basically a way to have a server run the code on the backend and serve it to a webpage. My idea for a site was to have multiple practice modes:

  1. Easy mode: Have a word in white text show on screen, as well as a small image of the correct way to sign the next letter (the letter images were taken from Wikimedia Commons, so they are open source). The words are sourced from the english-words package, with words less than 3 or greater than 10 letters removed. I also removed words with “z” because that was the most difficult letter to detect. For every correct letter the user makes, the letter becomes green and then the next correct sign appears.
  2. Medium mode: Same as easy mode, but no sign image is there to help.
  3. Hard mode: same as medium mode, but if enough incorrect letters are guessed, then the word starts over.
  4. Freeplay: Sign anything you want, and the predicted letters will show. To clear the screen of letters, move your hands out of the image.

After setting up the code, I needed to create an index.html file for the frontend of the site. I added in buttons to switch the modes, where making a switch refreshes the word to spell. The webcam feed will show up on the screen as well.

The HTML lists off some instructions, but the important part is the url_for parts: one is for the buttons that allow the user to change the mode, and one is for embedding the webcam video onto the webpage.

Let’s take a look at the Python file for the Flask app. I named it sign_app.py, so when spinning up Flask, it will be necessary to set FLASK_APP=sign_app.py in the command line.

The above code is a bunch of global instantiations for the modes. The parts worth mentioning are the RandomForest model loaded in using joblib, and the helper functions. Thecamera_max() function detects the maximum index of the user’s cameras. For example, I have an integrated webcam and an external webcam, so when the external is plugged in, the max index is 1. The mediapipe_detection function will detect the user’s hand when using the hand model, and return the “landmarks”, which is a set of 21 points on a hand, such as the wrist, finger segments, and palm locations. Lastly, the get_landmark_dist_test function calculates the scaled (depending on webcam resolution and distance from camera)

The modes are mostly the same functionally, so I just want to walk through easy mode. The other codes can be found on GitHub.

Easy mode looks for one hand at 60% confidence and puts the word to sign in white in the lower left corner of the video, with correct letters writing being overwritten in green. The letter_help image is pulled from the folder of open source images and superimposed on top of the video by replacing an image slice of the same size. If a hand is detected, then the x and y values of the bounding box are established by finding the minimum and maximum x and y of the landmarks.

Every third of a second, the frame is grabbed and a prediction on the letter is made. If the probability is over a certain threshold, then the letter gets predicted, and if it’s correct, then gets added to the user’s correct word; the helper image also changes to the next letter. For double letters, an extra check is made using the location variable to see if the hand is more than 10% of the screen away. Once a word is complete, a new word is generated half a second later.

The last part of the code is for Flask:

The @app.route() line acts as a decorator for determining the URLs that run the functions defined underneath. So for the default URL, only the HTML template is loaded. For the video feed, it connects to the HTML template url_for piece. And for the requests URL, a POST request comes from the user to make a change, where the server will change the mode and pick a new word. A GET request simply gets the HTML template.

To run in Flask, use python -m flask run on the command line, and go to 127.0.0.1:5000 (or a different port if that set). Please see the end of the video for a demo!

--

--

Ryan Belfer

Data Scientist-to-be | On the lookout for wonder in the world