Webcam Eye Tracker: Deep Learning with PyTorch
PyTorch model for eye tracking using webcam data
December 21, 2020
Table of Contents
So far we have extracted webcam features and collected coordinate data. Now we can use that dataset to create our deep learning model with PyTorch. The following models and analyses were conducted in a Jupyter notebook, which can be found here.
The problem we have is basically bounding box regression, but simplified to only 2 continuous output values (X-Y screen coordinate). To summarize, the data we have available to us:
- Possible inputs
- Unaligned face (3D Image)
- Aligned face (3D Image)
- Left eye (3D Image)
- Right eye (3D Image)
- Head position (2D Image)
- Head angle (Scalar)
- Outputs
- X screen coordinate
- Y screen coordinate
The goal is to find the most accurate model that can map some combination of inputs to an output X-Y coordinate pair. We will be experimenting with a few different models to find the best fit.
We will be using Mean Squared Error (MSE) as our loss function. When it comes to the "real world" accuracy of our model, we will take the square root of that (RMSE). This can be interpreted as the pixel-wise distance between our predicted location and the true location. For example, an MSE loss value of 10,000 would be equivalent to 100 pixels of inaccuracy in the prediction.
Dataset overview
The dataset contains 25,738 examples, with 69.01% of screen locations being sampled at least once. The entire dataset is 319MB in size.
The first thing we can do is check how that dataset is distributed across the screen:
In the bottom-left and top-right, you can see a 2D and 3D plot of the region map we created, which shows the number of data samples at each region of the screen.
In the top-left and bottom-right are histograms showing the number of samples within each section of the screen. As suspected, the center of the screen contains the most data samples, with the edges being relatively under sampled. This may end up reducing prediction accuracy near the edges and corners.
For those extreme screen regions, we can check to make sure there is good variation in input features. For example, we can see there is a difference in the way the eyes look when they are gazing at the 9 calibration locations:
Ingesting data
We first need a way to get our data into our models. For that we can use PyTorch Dataset and DataLoader. These allow us to define how data samples are retrieved from disk, and handles preprocessing, shuffling, and batching of the data. The benefit is that we don’t need to load the entire dataset into memory – data batches are loaded as needed.
PyTorch Dataset
For the Dataset, we can define where the data is stored in the __init__
method. Then the __getitem__
method defines what should happen when our DataLoader
makes a request for data. In this case it simply uses PIL to load the image and applies a few image transformations:
PyTorch DataLoader
The DataLoader
handles the task of actually getting a batch of data and passing it to our PyTorch models. Here you can control things like the batch size and whether the data should be shuffled. I created a function to split my entire dataset into train/validation/test sets, and creates a DataLoader
for each:
Face model
We’ll start by creating a simple model using only the unaligned face image. We can use PyTorch Lightning for this as it helps to streamline the code and remove a lot of boilerplate.
PyTorch Lightning
First, we create a LightningModule
, which is where we define the layers in the model (__init__
) and what happens during a single forward pass. This receives a config
object, which contains a number of hyperparameters that we will be tuning (more on this in the next section):
The unaligned face image is first passed through a convolution layer. That’s followed by a number of "convolution blocks", which consists of convolution, ReLu activation, batch normalization, and max pooling. The output from these blocks is then resized and passed into some fully connected layers with dropout.
A few things to note:
- Calling
save_hyperparameters()
is not required, but it allows us to log the hyperparameters being used in this model for reference and for Tensorboard - The
self.example_input_array
attribute is a blank input tensor, and is required for logging the graph to Tensorboard
The class also needs methods that define what happens during each step of training/validation/test. You can find the full class here.
Next, we have a function that creates our dataset, instantiates our model, creates a trainer, and fits the model:
Ray Tune
Finally, we need to wrap the training function in some Ray Tune code that allows us to do hyperparameter tuning. Ray Tune provides an extremely simple way to do (distributed) hyperparameter tuning. One of the best things about Ray Tune is that it offers an algorithm called ASHA.
Traditionally, when you tune hyperparameters using grid search or random search, you fully train all of your model/hyperparameters combinations. This can be a waste of resources, because you can tell early on that some models just won’t work well. ASHA is a halving algorithm that prunes poor performing models, and only full trains the best models.
Training the unaligned face model
With all the helper functions defined, training the model is as simple as providing a range of hyperparameter values as a config
dictionary, and calling our tune function. PyTorch also allows us to log training results to Tensorboard for analysis.
We start by exploring a wide range of values to get a sense of what the search space looks like:
If we look at the train and validation graphs, we can see ASHA in action. ASHA prunes poorly performing models to save time:
We can then check hyperparameter performance in Tensorboard to get a sense of how we can narrow the ranges:
For example, we can see that the best performing models (coloured blue) tend to have small batch sizes, learning rate around 1×10-4, and a larger number of fully connected (dense) nodes. We can use this information to fine tune the hyperparameter ranges and search again over more epochs.
After a second round of search, we take the best performing hyperparameters and train the final model over 50 epochs:
On the test set we get an MSE loss of 2362, which is a pixel error of around 48.6 pixels.
We can use the same functions to compare the face model to one using the aligned face images instead. You can find the details of this process in the Jupyter notebook. The aligned face model gives a larger MSE loss of 2539, and a pixel error of 50.4 pixels.
The performance with aligned faces is slightly worse. It’s possible that head angle is an important feature for eye tracking, and is being learned indirectly from the unaligned face image through multiple convolutions.
Eye model
For an eye tracker, it’s a good idea to check a model where we only input the eye images.
This is slightly more complicated as we have 2 input images, but we just need to add a second network of convolutions, and merge the results from the left and right eye image convolutions before going into the fully connected layers. You can find the full model definition here.
After going through the same process of exploring initial hyperparameter ranges, fine tuning the values, and fully training the best model of 50 epochs, we get an MSE loss of 3837, and a pixel error of 61.9 pixels.
Using eye image inputs appears to result in a model that is significantly worse than using the face image alone. Presumably this is because the face image already contains both eyes, and is able to isolate those regions through successive convolutions.
Multiple input model
So far, our best performing model uses a single unaligned face image. Next, we need to test more complex models that use combinations of these different images. I suspect that a full face image provides most of the information needed, but passing the eyes separately would help the model focus on those regions specifically. We can also pass in head angle and head position, which would allow us to keep the face network relatively shallow.
The plan is to use the unaligned face, left and right eye, head position, and head angle as inputs into the same model. Each of these features will be passed into a "sub" network of the model. We will also allow each one to take on different hyperparameter values (e.g., filter size, layer depth etc.).
Unknown sizes
This is the point in my PyTorch learning where I hit my only real difficulty. Each image input will pass through a different number of convolution/pool layers, using different number of kernels, with different layers sizes. If we plan on tuning each of these as separate parameters, then we cannot determine ahead of time what the final size post-convolution will be. Hence, we won’t know how many input nodes the fully connected layer needs.
You will need to manually track and calculate feature map sizes, for each image, after an undetermined number of convolution/pooling operations. This proved to be a bit of a headache as PyTorch requires layer dimensions to be defined in the __init__
method of the class. Doing this manually took over 100 lines of code. If someone knows of an automatic way to do this then please let me know!
The final model
After going through the hyperparameter exploration/tuning/finalization process, the best performing model had the following architecture:
Architecture:
- Unaligned face:
- Convolution
- Block: (convolution + ReLu + batch norm + max pool)
- Block: (convolution + ReLu + batch norm + max pool)
- Left eye
- Convolution
- Block: (convolution + ReLu + batch norm + max pool)
- Right eye
- Convolution
- Block: (convolution + ReLu + batch norm + max pool)
- Head position
- Convolution
- The outputs of each of the sub-networks above are merged with head angle
- Everything is then flattened -> 128 node fully connected layer -> 64 fully connected layer
Sizes can be seen below:
The model appears to perform best when the face image is passed through 2 full convolution blocks with a large filter size of 7×7. While head position (being only a 2D image) only requires a single convolution layer with a small filter size of 3×3.
When we pass the test set through this model, we get an MSE loss of 2037, and a pixel error of 45.1 pixels. This is the best performing model so far.
Errors over screen space
The pixel error values we’ve been looking at are averaged over the entire screen. This is useful for comparing models, but not so useful for determining if there are certain screen locations where our model performs poorly.
What we can do is plot the prediction error at each coordinate of the screen to see if there are any patterns. The functions for this are predict_screen_error and plot_screen_errors. They return a plot of errors over screen space.
In the case of the full model with multiple inputs, the error map looks like this:
What we can see, as suspected based on the original distribution of collected data, is that the screen edges and corners tend to have the highest prediction errors.
We can try to improve this by collecting more data, and prioritizing screen edges using the flags in config.ini
. When we double the dataset to 50k samples and retrain the same model, we end up with a pixel error of 48.4px.
The average pixel error is slightly worse than our previous attempt, but if we look at the error map we can see that the errors are smaller and more diffused across the entire screen:
Of course, there are more models we can test and we can find other ways to improve our prediction accuracy. But at this point we can try deploying it into a test application to see how well it works. See you in the next post.