[GSoC2020] Final report

Here is the final report summarizing the entire GSoC2020 project with RedHenLab. During this 3-month project, I was mainly guided by M. Jungseock Joo. With the help of M. Joo, we have delivered a fruitful project. The highlight of the project must be publishing a paper in the BEEU workshop of ECCV 2020. I am the first author of this paper.

I. Project definition

Although we have discussed several possible directions about this project, we chose two of them in the end as the main tasks that I would work on:

  • Analysis between body movement and emotion
  • Audio-Visual emotion recognition

II. Body movement and emotion

The topic is derived from the BEEU workshop:

First International Workshop on Bodily Expressed Emotion Understanding

The idea is to integrate body movement into the analysis of the emotion based on a televised debate dataset. This is in fact an extension of the previous work of Prof. Joo and his students, Christina Indudhara and Kaushik Mahorker. In the previous work, they have several contributions:

  • Collect a data set of 2020 presidential debates
  • Train a emotion recognition model and apply it to analyze the emotion status of the candidates during the debates.
  • Extract audio features from the debates data set and perform preliminary analysis
  • Measure the correlation between the emotion status and the audio features.

The previous work only considered facial expression as visual cues. The idea of our BEEU paper is to take body movement into consideration.

II.1 Definition of body movement

First, to measure the body movement of the candidates in the debates, we define delta(x) as the displacement of a key point in consecutive frames. First, we use Openpose to extract key points such as nose, neck, shoulders. Then we center the key points with respect to the neck. Last, we calculate the displacement delta(x) and normalize it with the nose to neck distance of the candidate. These steps allow the delta(x) to be compared over debates and candidates

As presented by the following figure, we can clearly see that the hand movement is different from candidate to candidate. Thus, we expect to have larger delta(x) for candidates like Bernie Sanders, but smaller delta(x) for candidates like Michael Bloomberg.

The first row is the overlapped skeleton of the candidates. Deeper the color, more stable the skeleton. The second row is two exemplar frames for each candidates.
II. 2 Correlation between body movement and emotion

After validating the delta(x) as a measure of body movement, we investigate the correlation between the emotion score and the delta(x) of each key point. Considering the body symmetry, we take the average of both left and right. For instance, the left and right shoulders are replaced by one shoulder category.

We obtain the following correlation table:

Correlation coefficients of emotion and body movement

The correlation coefficients are ordered from highest to lowest, colored from red to blue. It suggests that, candidates in an angry emotion tend to have larger body movement. In contrast, when the candidate shows a sad emotion, their body movement tends to be smaller.

II.3 Clustering on the feature vectors of body movement

we use the feature vector of delta(x) to cluster different video clips by K-means. A feature vector is a stacked key point for 15 consecutive frames. K is defined as 100, in order to separate meaningful clusters from noise. The idea is to discover meaningful patterns of body movement in an automatic manner. More precisely, meaningful clusters can then be used to build a list of communicative gestures. Regarding the previous work, the annotation of the body gestures is largely dependent on human perception. That is, we can only consider common gestures in the real-life. In contrast, the noise means that the cluster is not representative to a certain gesture.

Here we display some meaningful clusters that we have identified.

Examples of identified meaningful clusters
II. 4 Conclusion

As for the future directions, first, we would like to extensively explore the clustering and use the valid patterns to build up a list of communicative gestures. This list can then be used for automatic gesture annotation on large-scale data set. Second, in our current pipeline, multimodal cues are not jointly considered. We would like to explore the potential of fusing gestures, facial expressions, and audio features for emotion recognition. Last, we will extend our work to larger data set and verify the generalizability of our method.

II.5 Supplementary materials

The submitted paper can be found here:

The code of data processing and analysis can be found in my github:

https://github.com/kangzhiq/GSoC_Debate

We have prepared a video presenting our paper. The link of the video will be added later.

III. Audiovisual emotion recognition

The fusion of audio and visual cues for emotion recognition is a new trend in the research of emotion analysis. It have been proved to be effective. Our intension is to first understand the state-of-the-art and most of the popular methods used for audiovisual emotion recognition, and then try to propose a new method which is capable to outperform the SOTA in a specific data set: IEMOCAP.

III.1 IEMOCAP data set

We decided to use this data set because it contains both audio and visual information on the subject. Regarding the visual information, there are raw videos of the subject when they are talking, as well as the motion capture data. The motion capture data is from the motion of facial muscles, hands. There are two types of recordings, improvised or scripted.

Example of the recording. We can see motion capture devices from the speaker on the left.
III.2 Literature

It is important to understand what the SOTA is and what the popular methods being used for audiovisual emotion recognition are. Thus, I prepared a table summarizing some previous work on this topic.

In this file, we can find various techniques for this task, including LSTM, 3DCNN, Attention mechanism, etc.

Before that, I have also spent time doing background research on the global emotion recognition tasks, as in the following posts:

The list of posts written during this project is as follows:

III.3 Contribution

Unfortunately, since this task is quite challenging, I didn’t manage to propose a novel method to improve emotion recognition accuracy during the GSoC project. After a discussion with my mentor Prof. Joo, I would like to stay with him after the project being finished and continue to work on audiovisual emotion recognition. Our goal is to have a technical paper that explains our proposed novel idea for improving audiovisual emotion recognition accuracy.

For the time being, I have the following directions to work on

  • Read more papers and have a global view about the all different methodologies
  • Try to analyze when the current work would fail, and then try to propose a solution for this specific case
  • Prepare a presentation about audiovisual emotion recognition and share it with the students of Prof. Joo. The discussion would surely trigger new ideas.

IV. Conclusion of this GSoC project

I am very happy that I was accepted in the first place and have been patiently guided by Prof. Joo all along the project. The paper is indeed the most valuable gift that this GSoC project gives me. This is the first time that I prepared/wrote/submitted a paper. This experience is definitely precious and has founded a basis for my future research plan.

I would like to express my deepest gratitude to my mentor and the Red Hen Lab community. It is a great pleasure to meet you this summer!