<creating a="" aid="" app="" estimation="" gym="" pose="" techniques="" training="" using=""></creating>

Building an app that leverages pose estimation algorithms can significantly enhance workout routines. My journey in developing a straightforward gym training aid app centers around utilizing computer vision and machine learning to assess and refine exercise form.

The Challenge

Upon beginning my home workout journey, I quickly realized the difficulties of maintaining correct form without the guidance of a personal trainer. Although mirrors offered some assistance, they were insufficient. I sought a method to analyze my movements and gain feedback on my technique. Given my affinity for metrics, I desired a way to quantify my progress and monitor improvements over time, despite the challenges of real-time application.

High-Level Overview

My objective was to create an application capable of reviewing exercise videos and offering insights on my form. Here’s a brief outline of my strategy:

Implement a keypoint detection model to evaluate exercise videos.
Contrast my movements with those of a professional.
Develop a metric that indicates whether my exercise execution is correct and identifies areas for enhancement.

Requirements

As I explored potential solutions, I established key requirements. I needed an easy-to-implement solution that would run efficiently on my MacBook Pro M1, allowing for rapid experimentation without the expense of high-end GPUs. My aim was to innovate and refine my approach without being overwhelmed by technical specifications or costly hardware. Thus, I began delving into pose estimation algorithms and the realm of computer vision.

Introduction to Pose Estimation

Pose estimation is a well-researched domain with applications in various fields, including action recognition, activity tracking, augmented reality, gaming, and animation. The primary aim is to identify the location and orientation of a person's body parts, like joints and limbs, within an image or video.

There are two primary categories of pose estimation: single-person and multi-person. Single-person pose estimation focuses on identifying the pose of an individual in an image, making it a regression problem. Conversely, multi-person pose estimation tackles the more complex challenge of detecting multiple individuals and their positions within an image.

Single-person pose estimation can be subdivided into direct regression-based and heatmap-based frameworks. The former predicts keypoints directly from a feature map, while the latter generates heatmaps for all keypoints and employs further methods to produce the final representation.

Finding an Effective Keypoint Detection Model

As I immersed myself in pose estimation, I encountered numerous keypoint detection models. Among the top contenders, OmniPose showcased remarkable accuracy. However, I was particularly drawn to OpenMMLab’s Pose Estimation Toolbox, which offers a robust framework for all things related to pose estimation, including a model comparison benchmark.

Seeking a straightforward and lightweight solution, I opted for Google’s MoveNet. This compact and efficient pose estimation model is designed for mobile and embedded devices, featuring approximately 4 million parameters compared to OmniPose's 68 million. MoveNet's simplicity made it an ideal choice for my project, facilitating rapid prototyping without demanding significant computational resources. While it may not match the accuracy of more complex models, it served as a solid starting point.

MoveNet Functionality

So, how does MoveNet operate? Essentially, it utilizes heatmaps to pinpoint human keypoints accurately. As a bottom-up estimation model, it first identifies human joints and then constructs the pose from these joints.

The MoveNet architecture consists of two main elements:

Feature Extractor: A MobileNetV2 coupled with a Feature Pyramid Network. MobileNetV2 is a lightweight convolutional neural network ideal for mobile and embedded applications. The Feature Pyramid Network enables MoveNet to capture features at various scales, crucial for detecting keypoints at different distances from the camera.
Predictor Heads: A series of predictor heads linked to the feature extractor, responsible for predicting:
- The geometric center of the individual
- The complete set of keypoints for the person
- The position of all keypoints
- Local offsets from each output feature map pixel to the precise sub-pixel location of each keypoint

MoveNet is accessible on TensorFlow Hub, along with an extensive array of tutorials, documentation, and associated code, ensuring a smooth introduction to the model. Impressively, MoveNet can operate in a browser, achieving over 30 frames per second on most modern devices, including smartphones. This capability makes it particularly suitable for fitness, health, and wellness applications, where prompt feedback and low latency are essential.

Extracting Keypoints

MoveNet identifies 17 keypoints spanning from the nose to the ankles, outputting a 17x3 tensor. Each row corresponds to the normalized X and Y coordinates of the keypoint and a confidence score.

I qualitatively assessed the keypoint detection results from my recordings and was satisfied with the model's accuracy, as it could effectively identify keypoints given adequate lighting and clear angles. The confidence scores provided insight into the reliability of the detections, allowing me to disregard any low-confidence keypoints.

Overall, I was impressed with MoveNet's performance as a tool for extracting keypoints from my recordings.

Transitioning from Frames to Sequences — Aligning Recordings

While extracting keypoints from single frames is vital, it is insufficient for practical applications. It's essential to consider that recordings may not align perfectly. Comparing keypoints frame by frame without alignment would yield incorrect results. If one recording starts even slightly earlier than another, the keypoints will misalign, despite identical movements. To make the scores meaningful, I needed to synchronize the frames of each recording.

I performed most of the alignment manually using video editing software, trimming and adjusting recordings for synchronization. To enhance the alignment, I applied Dynamic Time Warping (DTW), a method that compares sequences of varying lengths or timings, refining the alignment to ensure accurate keypoint matching.

Manual alignment with DTW refinement sufficed for my simple use case. However, this labor-intensive method isn’t scalable for real-world applications, necessitating the automation of the alignment process. Developing algorithms that can synchronize recordings accurately amidst noise and variability presents another challenge worth exploring in a separate article.

Comparing Movements

With the sequences aligned, it was time to compare my movements against a professional’s. I employed cosine similarity, a prevalent metric in the pose estimation domain.

Cosine similarity quantifies the similarity between two vectors by calculating the cosine of the angle between them. In pose estimation, it’s often utilized to compare two sets of keypoints (e.g., body joints or facial landmarks). This metric is favored due to its resilience to scale and rotation variations, making it apt for pose comparisons.

I experimented with various variations of this metric and documented my findings. I recorded myself twice: first, executing the exercise as accurately as I could (shown on the left in the GIF), and second, attempting to perform it incorrectly (on the right) — with noticeable forward leaning of my back during the movement. The professional's execution is in the center (the reference).

Simple Cosine Similarity

The most straightforward approach I devised was to analyze the entire movement simultaneously. I concatenated all keypoints into a vector with the shape [num_frames * 17 (num_keypoints) * 2 (coordinates)] and calculated the cosine similarity between my movements and the professional's. The results were as follows:

cos_sim(correct_movement, professional) = 0.8409
cos_sim(incorrect_movement, professional) = 0.8255

It was evident that the second movement was less similar to the reference, though the difference (0.0154) was minimal.

Frame-by-Frame and Averaging

My next approach capitalized on the alignment of frames. I computed the cosine similarity of keypoints on corresponding frames (during similar movement phases) and averaged the results.

From the chart, it was clear that the movement on the right was less akin to the reference (and hence poorer). The score on the right dropped to 0.79, consistently lower than the left.

The mean scores were nearly identical to those from the first approach:

mean cos_sim(correct_movement, professional) = 0.8411
mean cos_sim(incorrect_movement, professional) = 0.8256
median cos_sim(correct_movement, professional) = 0.8399
median cos_sim(incorrect_movement, professional) = 0.8257

Weighted Similarity

I had yet to leverage the third score returned by MoveNet — the keypoint confidence score.

Certain keypoints (like the left elbow) were barely discernible in the reference recording, and the same applied to my videos, as I attempted to capture from a similar angle.

I incorporated confidence scores as weights in the computation of weighted cosine similarity, ensuring that clearly visible keypoints received greater emphasis. This methodology yielded the following scores:

mean cos_sim(correct_movement, professional) = 0.8135
mean cos_sim(incorrect_movement, professional) = 0.7976

The results reaffirmed that the second movement was inferior to the first, though the difference was slight. For practical applications, further refinement of the metric would be necessary.

Future Enhancements

Reflecting on my project, two primary areas for enhancement stand out:

Advancing Core CV/AI Technology

From a technical perspective, numerous avenues exist to refine the pose estimation algorithm. For example, optimizing and calibrating comparison metrics could enhance the accuracy of exercise form assessments. Another strategy might involve analyzing bones or entire limbs instead of solely focusing on joints, providing a more comprehensive understanding of movement. Additionally, ensuring algorithm resilience to variations in camera angles, lighting, and other environmental influences would enhance reliability.

Production and User Experience

The second improvement area pertains to production readiness. To deliver a seamless user experience, I would need to automate the entire process, necessitating more time for data preprocessing and alignment. This endeavor would require streamlining the workflow, managing potential technical hurdles, and crafting an intuitive interface. Furthermore, compiling a diverse library of exercises across various settings, including different camera angles and environments, would be vital in offering users a wide array of options and scenarios for practice.

The Jupyter Notebook that facilitated the creation of this post is available here.

provocationofmind.com

<Creating a Gym Training Aid App Using Pose Estimation Techniques>

The Challenge

High-Level Overview

Requirements

Introduction to Pose Estimation

Finding an Effective Keypoint Detection Model

MoveNet Functionality

Extracting Keypoints

Transitioning from Frames to Sequences — Aligning Recordings

Comparing Movements

Simple Cosine Similarity

Frame-by-Frame and Averaging

Weighted Similarity

Future Enhancements

Advancing Core CV/AI Technology

Production and User Experience

Further Reading

Share the page:

Recent Post:

# Transforming My Life by Quitting Sugar: A Personal Journey

Finding Your Confidence: 5 Steps to Embrace Your Inner Strength

The Great Red Spot: A Closer Look at Jupiter's Enduring Storm

Beware: The Deceptive Allure of Psychics and Scammers

Historic Spacewalks: The Pioneering Journeys Beyond Earth

Understanding Stock Market Returns: Expectations vs. Reality

Navigating Crypto Echo Chambers for Informed Research

Massive AI Breakthrough: A New Approach for COVID-19 Pre-Screening