Final Report GSOC2021

Introduction of the project

The subject of this project is “Gesture temporal detection pipeline for news videos”. The gesture temporal detection in this project is a mission for detecting human hand gestures in a video. For now, our objective is to locate gestures in temporal space. For each frame of a video, we need to predict whether it is included in an action of hand gestures from persons in it. As non-verbal communication plays an important role in our daily life, a detector like this can have a meaningful use for researchers, especially for linguists. Because it is a complicated task for a human to extract gestures from videos of high volume. With this kind of detector, it can help researchers to extract gestures and then do further analyses. Besides, it could be also useful to help researchers to understand the correlation between a hand gesture and human body movement.
The code of this project can be found on my github.
The project link:

State of the project

After two months of work, I finished the first version of this project. In this version, the video that we can deal with has to consent to the following conditions:

  1. The length of the video needs to be around 10 seconds with around 300 frames because I trained the model on the Chalearn hand gesture dataset and all of the data for training are small pieces of videos. Furthermore, the model is built based on Transformer, if the video is long, it will calculate attention for keypoints from unrelated timestamps and it will slow down the system. If it is a long video, we need to split the frames into groups to generate predictions this procedure will be added.
  2. The video needs to have an acceptable resolution. The project used Openpose to extract keypoints from video. If the resolution of a video is too low, it will fail to extract keypoints.
  3. The video can have only one person in a scene because I haven’t added a model to decompose multi-person scenes into single-person keypoints. But this part could be easily added.

Pipeline for gesture temporal detection

Our pipeline is composed of three parts.

A video processing script

The video processing script located in In this part, I created some tools for video and frames convertion and also a function for keypoints visualisation and a frames selection function. In video processing, we mainly generate the final frames of video which will be used later for Openpose to extract keypoints.

A singularity container which is epuiped with Openpose

The definition file of this container can be found in openpose.def and you can use this file to generate a singuarity container that contains Openpose directly. When you have this container, you can log in with a user named ‘yxz2560’ (my account name on HPC) and mount home with a structure as following:

    │   |───video
    │   |───frames
    │   |───keypoints

Normally, I put a video in video folder and generate frames in frames folder then the frames with keypoints and the keypoint’s json file is generated in keypoints folder. There are also some sub-folders in these three folders, and they are automatically created by my scripts.

Then in the container, you can run to generate frames with keypoints for visualisation and a json file which contents the necessaire keypoints information.

A gesture detector based on DETR

DETR is a model for object detection, and I wrote a post to introduce this model. The whole model and the data loader are modified to adapt to our problem. The source code of the model can be found here

The model is trained on Continuous Gesture Recognition dataset. The model is trained on tesla V100 with 500 epochs and it takes 7 hours.

I created also a container with Anaconda environment in it. The definition file can be found here: You can follow my post of the first week’s GSOC report to create the same container.

To make a prediction, we need to have the json file of keypoints generated by Openpose. These keypoints will be the input of our model. The model will make 10 predictions for each video. We set the number of prediction to 10 because a small clip of video normally only have 2 or three gestures. You can run and it will generate the final results with labeled frames and a labeled video.


In this part, I am going to show you some results of differnet steps of my pipeline.

  1. Output of the selected frames in a video
  2. Output of the Openpose on that video
  3. Output of the gesture detector


During 2 months of work, I built a very simple pipeline for “Gesture temporal detection pipeline for news videos”, But there is still quite a lot of work to be accomplished. Firstly, I need to make the model works on a long video with multi-person. Then I need to finish the statistical result evaluation with some matrix based on Jaccard distance. The precision of my model needs to be improved.
This work is just the start of my journey in Redhen lab. And thanks to Redhen lab and my mentors to give me this opportunity to work on this interesting project. Thanks for your patience and your guide during these two months. This will help me a lot for my further work on this project.

To be continued!!!