Realtime Pose Comparison With TensorFlow.js

I’m creating a dance game in the browser that uses TensorFlow.js (also referred to as MoveNet, which is the model used) to analyze a person’s movements and will compare those movements to those of the song that they’re dancing to.

In the previous blog posts, I outlined a general plan and talked about how to use YouTube videos with TensorFlow.js. Now that we’ve got the video, we’ll need to compare each frame of it with the webcam stream from the user, all in realtime. This way, the user can see how well they’re doing at any given time as they play the song.

How do we compare the poses and dance moves between one person and another? How do we account for different body shapes and sizes?

The Plan

When you analyze an image (or frame of a video in my case), TensorFlow.js returns some data that looks a little like this:

    "keypoints": [
        {
            "y": 95.41931572589485,
            "x": 289.713457280619,
            "score": 0.8507946133613586,
            "name": "nose"
        },
        {
            "y": 87.39720528471378,
            "x": 299.0246599912063,
            "score": 0.8859434723854065,
            "name": "left_eye"
        },
        {
            "y": 89.00106838638418,
            "x": 279.21988732828237,
            "score": 0.7947761416435242,
            "name": "right_eye"
        },
        ... (and more, 17 keypoints total)

Each keypoint has an x and y position (where the keypoint is on the screen), score (how confident TFJS is that this keypoint is correct), and name (label for the keypoint).

Here is a diagram of all the keypoints on a human model (indices are simply the order of the keypoints returned):

(More detailed info here about the keypoint diagram)

This is all the information we get from TensorFlow.js, and we need to somehow use this data to fit our needs. We are going to get two sets of this type of data: one for the dance video that we need to match, and one for our live webcam feed.

We need to give the player a score to tell them how we’ll they’re doing using this data. How can we take raw 2D positional data and turn it into something useful? And after we turn it into something useful, how can we determine how well a person is performing the correct dance move?

Initial Thoughts

These were my initial, unsorted thoughts:

Base the keypoint data positions on a center, average position in the middle of the chest. This way, when the person moves, the keypoints will move with them, and thus, the keypoints will stay still. By applying this to the live keypoint data, both will be in a somewhat normalized space.

Next up is the problem of seeing how well the keypoint data sets match.

A person might be taller or shorter or have different body size or limb proportions than the dancer in the video, so how do we scale/transform them to match? It needs to be a connection/limb based scaling/transformation, because simply scaling someone on the y axis down won’t always work. Someone might have a long torso and short arms, or a short torso and long arms. These need to be taken into account, so we need to transform the distances between each of the keypoints.

We will need to get measurements of a person before they begin. We’ll have them do a T-pose and record the measurements of each limb.

But how can we get the measurements of the dancer that they are following in the video? That dancer isn’t going to T-pose for you.

During the analysis of the dance with TFJS, we could also record the maximum length of each limb/connection. We use the maximum instead of an average because a person can’t stretch past their maximum limb length - that’s just their limb length.

Now that we have corresponding limb lengths of both dancers, how do we transform one to “fit” the other?

We need to scale each limb along its axis, taking all other connected points with it.

For example, if one dancer’s shoulders are farther apart than the dancer we are comparing to, we need to shift those shoulders closer together. Shifting these shoulders closer together will also cause the arms to shift in closer, because otherwise we would have really long arms. And shifting the arms is shifting multiple, connected keypoints.

The General Plan

First, record the dance video keypoint data:

Run the video through MoveNet and record all keypoint data at each frame in the video.
Run this data through a filter to make each keypoint position based on the average chest position at that point.
Convert keypoint positions and limb lengths from pixel values to another unit that’s not based on how many pixels they take up. We can take the body length (torso length + leg length) and divide everything by it to get all measurements relative to the body length. For example, shoulder-to-elbow length might be 0.2 BLU, or body-length-units. The torso itself might be closer to 0.4 BLU.

Now we can take the live video and transform its keypoint data to the expected dance video keypoint data:

Get the player’s measurements by having them make a T-pose and running it through MoveNet. Get the measurements in BLU.
Run the video through MoveNet and get the keypoint data for the current frame.
Run this data through a filter to make each keypoint position based on the average chest position at that point.
Convert keypoint positions and limb lengths from pixels to BLU.
Transform player BLU keypoints and limb lengths to dancer BLU keypoints and limb lengths.
Compare the distances of player vs dancer BLU keypoint positions to see how well the player is performing the dance.

Transforming the data in step 5 will be a difficult step. In BLU, every body part is relative to the body length, so we need to match up the body length, then match up each limb length.

Another issue that might come up though is if the dancer in the video moves closer/father to/from the camera. This might mess up BLU measurements if BLU only uses the absolute maximum limb lengths, rather than limb lengths at a current point in time. This can probably be solved by detecting if the dancer is moving closer/farther to/from the camera and then scaling the limb lengths based on that, which will affect the BLU measurements.

How do we detect the approximate distance of a person from the camera, though? We can potentially use the side lengths of the abdomen since those won’t change much, even when spinning or rotating. Those would only change if the person was laying on the ground and wasn’t facing the camera. Or we could take the BLU reference unit (total body length in pixels) and divide that by the height of the video. It would still be skewed if the person rotated in a way that made them appear as having a shorter abdomen or legs, but it could work.

Also, some dance videos zoom in/out. This must be taken into account somehow as well.

Scoring After Transforming

After applying the above transformation methods to make the keypoints as similar as possible, we need to figure out a scoring method to determine how similar the two data sets are.

We could use some sort of 2D distance formula combined with a threshold. Say, a distance of 5 units (I say units here because the measurements are currently arbitrary) is the maximum distance someone can be from the expected keypoint. That would be a score of 0, and a distance of 0 would be a score of 1. Anything in between would be on a sliding scale, but what kind of sliding scale? Linear, quadratic, cubic, or something different? It could be good to have a quadratic scale so it’s easier to match to start, but gets more difficult as you get closer to matching it. Or, on the flip side, it could get easier as you get closer. This would help to account for errors within TensorFlow.js as well as stuttering or other issues.