<Creating a Gym Training Aid App Using Pose Estimation Techniques>
Written on
Building an app that leverages pose estimation algorithms can significantly enhance workout routines. My journey in developing a straightforward gym training aid app centers around utilizing computer vision and machine learning to assess and refine exercise form.
The Challenge
Upon beginning my home workout journey, I quickly realized the difficulties of maintaining correct form without the guidance of a personal trainer. Although mirrors offered some assistance, they were insufficient. I sought a method to analyze my movements and gain feedback on my technique. Given my affinity for metrics, I desired a way to quantify my progress and monitor improvements over time, despite the challenges of real-time application.
High-Level Overview
My objective was to create an application capable of reviewing exercise videos and offering insights on my form. Here’s a brief outline of my strategy:
- Implement a keypoint detection model to evaluate exercise videos.
- Contrast my movements with those of a professional.
- Develop a metric that indicates whether my exercise execution is correct and identifies areas for enhancement.
Requirements
As I explored potential solutions, I established key requirements. I needed an easy-to-implement solution that would run efficiently on my MacBook Pro M1, allowing for rapid experimentation without the expense of high-end GPUs. My aim was to innovate and refine my approach without being overwhelmed by technical specifications or costly hardware. Thus, I began delving into pose estimation algorithms and the realm of computer vision.
Introduction to Pose Estimation
Pose estimation is a well-researched domain with applications in various fields, including action recognition, activity tracking, augmented reality, gaming, and animation. The primary aim is to identify the location and orientation of a person's body parts, like joints and limbs, within an image or video.
There are two primary categories of pose estimation: single-person and multi-person. Single-person pose estimation focuses on identifying the pose of an individual in an image, making it a regression problem. Conversely, multi-person pose estimation tackles the more complex challenge of detecting multiple individuals and their positions within an image.
Single-person pose estimation can be subdivided into direct regression-based and heatmap-based frameworks. The former predicts keypoints directly from a feature map, while the latter generates heatmaps for all keypoints and employs further methods to produce the final representation.
Finding an Effective Keypoint Detection Model
As I immersed myself in pose estimation, I encountered numerous keypoint detection models. Among the top contenders, OmniPose showcased remarkable accuracy. However, I was particularly drawn to OpenMMLab’s Pose Estimation Toolbox, which offers a robust framework for all things related to pose estimation, including a model comparison benchmark.
Seeking a straightforward and lightweight solution, I opted for Google’s MoveNet. This compact and efficient pose estimation model is designed for mobile and embedded devices, featuring approximately 4 million parameters compared to OmniPose's 68 million. MoveNet's simplicity made it an ideal choice for my project, facilitating rapid prototyping without demanding significant computational resources. While it may not match the accuracy of more complex models, it served as a solid starting point.
MoveNet Functionality
So, how does MoveNet operate? Essentially, it utilizes heatmaps to pinpoint human keypoints accurately. As a bottom-up estimation model, it first identifies human joints and then constructs the pose from these joints.
The MoveNet architecture consists of two main elements:
- Feature Extractor: A MobileNetV2 coupled with a Feature Pyramid Network. MobileNetV2 is a lightweight convolutional neural network ideal for mobile and embedded applications. The Feature Pyramid Network enables MoveNet to capture features at various scales, crucial for detecting keypoints at different distances from the camera.
- Predictor Heads: A series of predictor heads linked to the feature extractor, responsible for predicting:
- The geometric center of the individual
- The complete set of keypoints for the person
- The position of all keypoints
- Local offsets from each output feature map pixel to the precise sub-pixel location of each keypoint
MoveNet is accessible on TensorFlow Hub, along with an extensive array of tutorials, documentation, and associated code, ensuring a smooth introduction to the model. Impressively, MoveNet can operate in a browser, achieving over 30 frames per second on most modern devices, including smartphones. This capability makes it particularly suitable for fitness, health, and wellness applications, where prompt feedback and low latency are essential.
Extracting Keypoints
MoveNet identifies 17 keypoints spanning from the nose to the ankles, outputting a 17x3 tensor. Each row corresponds to the normalized X and Y coordinates of the keypoint and a confidence score.
I qualitatively assessed the keypoint detection results from my recordings and was satisfied with the model's accuracy, as it could effectively identify keypoints given adequate lighting and clear angles. The confidence scores provided insight into the reliability of the detections, allowing me to disregard any low-confidence keypoints.
Overall, I was impressed with MoveNet's performance as a tool for extracting keypoints from my recordings.
Transitioning from Frames to Sequences — Aligning Recordings
While extracting keypoints from single frames is vital, it is insufficient for practical applications. It's essential to consider that recordings may not align perfectly. Comparing keypoints frame by frame without alignment would yield incorrect results. If one recording starts even slightly earlier than another, the keypoints will misalign, despite identical movements. To make the scores meaningful, I needed to synchronize the frames of each recording.
I performed most of the alignment manually using video editing software, trimming and adjusting recordings for synchronization. To enhance the alignment, I applied Dynamic Time Warping (DTW), a method that compares sequences of varying lengths or timings, refining the alignment to ensure accurate keypoint matching.
Manual alignment with DTW refinement sufficed for my simple use case. However, this labor-intensive method isn’t scalable for real-world applications, necessitating the automation of the alignment process. Developing algorithms that can synchronize recordings accurately amidst noise and variability presents another challenge worth exploring in a separate article.
Comparing Movements
With the sequences aligned, it was time to compare my movements against a professional’s. I employed cosine similarity, a prevalent metric in the pose estimation domain.
Cosine similarity quantifies the similarity between two vectors by calculating the cosine of the angle between them. In pose estimation, it’s often utilized to compare two sets of keypoints (e.g., body joints or facial landmarks). This metric is favored due to its resilience to scale and rotation variations, making it apt for pose comparisons.
I experimented with various variations of this metric and documented my findings. I recorded myself twice: first, executing the exercise as accurately as I could (shown on the left in the GIF), and second, attempting to perform it incorrectly (on the right) — with noticeable forward leaning of my back during the movement. The professional's execution is in the center (the reference).
Simple Cosine Similarity
The most straightforward approach I devised was to analyze the entire movement simultaneously. I concatenated all keypoints into a vector with the shape [num_frames * 17 (num_keypoints) * 2 (coordinates)] and calculated the cosine similarity between my movements and the professional's. The results were as follows:
- cos_sim(correct_movement, professional) = 0.8409
- cos_sim(incorrect_movement, professional) = 0.8255
It was evident that the second movement was less similar to the reference, though the difference (0.0154) was minimal.
Frame-by-Frame and Averaging
My next approach capitalized on the alignment of frames. I computed the cosine similarity of keypoints on corresponding frames (during similar movement phases) and averaged the results.
From the chart, it was clear that the movement on the right was less akin to the reference (and hence poorer). The score on the right dropped to 0.79, consistently lower than the left.
The mean scores were nearly identical to those from the first approach:
- mean cos_sim(correct_movement, professional) = 0.8411
- mean cos_sim(incorrect_movement, professional) = 0.8256
- median cos_sim(correct_movement, professional) = 0.8399
- median cos_sim(incorrect_movement, professional) = 0.8257
Weighted Similarity
I had yet to leverage the third score returned by MoveNet — the keypoint confidence score.
Certain keypoints (like the left elbow) were barely discernible in the reference recording, and the same applied to my videos, as I attempted to capture from a similar angle.
I incorporated confidence scores as weights in the computation of weighted cosine similarity, ensuring that clearly visible keypoints received greater emphasis. This methodology yielded the following scores:
- mean cos_sim(correct_movement, professional) = 0.8135
- mean cos_sim(incorrect_movement, professional) = 0.7976
The results reaffirmed that the second movement was inferior to the first, though the difference was slight. For practical applications, further refinement of the metric would be necessary.
Future Enhancements
Reflecting on my project, two primary areas for enhancement stand out:
Advancing Core CV/AI Technology
From a technical perspective, numerous avenues exist to refine the pose estimation algorithm. For example, optimizing and calibrating comparison metrics could enhance the accuracy of exercise form assessments. Another strategy might involve analyzing bones or entire limbs instead of solely focusing on joints, providing a more comprehensive understanding of movement. Additionally, ensuring algorithm resilience to variations in camera angles, lighting, and other environmental influences would enhance reliability.
Production and User Experience
The second improvement area pertains to production readiness. To deliver a seamless user experience, I would need to automate the entire process, necessitating more time for data preprocessing and alignment. This endeavor would require streamlining the workflow, managing potential technical hurdles, and crafting an intuitive interface. Furthermore, compiling a diverse library of exercises across various settings, including different camera angles and environments, would be vital in offering users a wide array of options and scenarios for practice.
The Jupyter Notebook that facilitated the creation of this post is available here.
Further Reading
- Google | MoveNet | Kaggle
- Pose estimation, tracking, and comparison
- MoveNet: Ultra fast and accurate pose detection model | TensorFlow Hub
- TensorFlow’s New Model MoveNet Explained | by Sam Hannan | Medium