[GSoC2020] Overview of Audio-Visual emotion recognition


Survey on audiovisual emotion recognition: databases, features, and data fusion strategies

[Note] The survey is done in 2013. So it is kind of out-of-date.

Facial expression contributes 55% and vocal expression contributes 38% to the emotion. The two most important resources.

A combination of audio-visual emotion recognition can date back to the beginning of the 2000s.


At 2013, most of the datasets are constructed between 1996 and 2005. It was already old. Overview of the dataset:

Currently requesting access to IEMOCAP
Emotion categorization:

Six prototypical emotions: anger, disgust, fear, happiness, sadness, and surprise.

3 categories:

  • discrete categorical representation
    categorical classification is limited, so they passed to dimensional representation
  • continuous dimensional representation
    effective dimensions such as Activation(passive/active), Expectation, Power/Dominance(sense of control), Valence(negative/positive)
  • event representation
    events such as laugh, smile, sigh, hesitation, consent, etc.
Audio feature

prosodic features are the most significant: pitch and energy related features

Voice quality features :

  • Harmonics-to-Noise Rati(HNR)
  • jitter
  • shimmer
  • spectral
  • cepstral features: Mel-Frequency Cepstral Coefficients(MFCC)[Note: Jungseock mentioned this feature in the meeting][TODO: understand what it is]

Two categories of features for speech emotion recognition:

  • Local(frame-level) features
    Spectral Low-Level Descriptors(LLDs): MFCCs, Mel Filter Bank(MFB)
    Energy LLDs: loudness, energy
    Voice LLDs: jitter, shimmer
  • global(utterance-level )features
    the set of functional extracted from the LLDs:
    max, min, mean, std, duration, linear predictive coefficients(LPC)
Facial features:

Two categories:

  • Appearance
    wrinkles, bulges, furrows
  • Geometric features
    shape or location of facial components

Local binary patterns(LBPs)

IEMOCAP contains detailed facial marker information.[Note: hopefully it will be the dataset that I will use]

Bimodal fusion

The first problem is the mismatched frame rates between audio and visual features. Linear interpolation technique is used to augment the frame rate of video to match that of audio. In contrast, some studies reduce the frame rate of the audio to match the video.

4 Fusion categories:

  • feature-level fusion
    Concatenating the audio features and visual features
    high-dimensional feature set might be sparse
    no interactions between features
  • decision-level
    specific classifiers for specific signals, then the recognition result from each classifier are fused in the end.
    Facial and vocal features are complementary to each other, so the assumption of conditional independence among multiple modalities at the decision level is inappropriate.
  • model-level
    correlation among multiple modalities
    Many of them use Hidden Markov Model(HMM) based models
  • hybrid approaches
    Combining previous models

Synchronization between audio and visual signals: visual signals(e.g. smile) appears earlier than the audio signals(e.g. laughter):

  • Decision-level fusion can naturally resolve the problem.
  • Model-level fusion de-synchronize the audio and visual streams and align the audiovisual signals at the state level.

Emotional sub-state or emotional state transitions:


Issues to be addressed
  • How to extend the mode to unconstraint conditions: emotion recognition in the wild.
  • A new database covering social signals such as laughs, smiles, depression, etc.
  • More emotion-related information: textual(speech content), body gesture
  • For normalization, a neutral example is selected manually. This processing can be automated.
  • Data fusion: considering the various model properties, temporal expression, asynchrony issue
  • Correlation between expression and the personality trait.
  • Personalized emotion recognition: small-sized adaptation database
  • Robustness: pose variation, partial facial occlusion, etc.

A Review of Audio-Visual Fusion with Machine Learning

No offense, but the quality of the review is just so so. It does help me to target some papers of audio-visual emotion recognition.

Decision-level fusion might cause a loss of information, so in this work Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis, Poria et al. merge some feature players under the absolute synchronization of the audio and visual signal.

Measures and metrics for automatic emotion classification via FACET

Review of automatic emotion classification evaluation metrics

FACET software provides a evidence value as the score of emotion recognition. However, it is only frame-based. They tried to aggregate the evidence values. For video level evaluation, they further computed two metrics:

  • Recognition sensitivity
    For each expression, the percent of frames containing the target evidence >0
    The result is aggregated across the database to yield an average percentage score indicating the overall proportion of frames that were correctly identified.
  • Recognition confidence
    Confidence if the number of target evidence(x) above-threshold w.r.t. target(x) + non-target(y) above threshold.
    The threshold is defined such that the frame having evidence higher than it is considered valid.

The metric is dedicated to GACET software. So I didn’t read the experiment section.


Multi-cue fusion for emotion recognition in the wild

  • Fusing facial texture, facial landmark action, and audio signal
  • Deploy BRNN when modeling facial texture changes to capture dynamic change information
  • A new modality: facial landmark action, to describe the motion of facial muscles
  • audio-CNN structure to deal with audio signals.
  • Preprocessing
    • Face detection: SSD
    • Facial landmark annotation
  • CNN-BRNN for facial texture
    • CNN for discriminative facial features:
      • pre-trained VGG-face, transfer learning
        more discriminative than hand-crafted features like SIFT or HoG
      • Cropped images fro video clips as training samples
      • The output of fc7: 4096-vector
    • RNN for a temporal relationship: Bi-direction
      • The process of expression can be reversed
        beginning-peak > peak-vanishing
  • SVM and CNN for facial landmark action
    • A stimulation of Action Units(AUs) from the facial action coding system(FACS)
    • For each frame, they extract 51 landmarks and construct a 102-vector. Then the vector for each frame are staked to form a feature of the video and fed to a SVM
    • Note:
      It means that the number of 102-vector should be equal for each video. In the paper they take the average length, but how to do interpolation?
    • Answer:
      They simply repeat the first frame.
    • The same structure is passed to CNN to capture the expression
    • 51 landmarks for SVM and 68 for CNN
  • CNN for audio signal
    • Instead of using the Fourier transform, they extract low-level acoustic features by openSMILE toolbox. Then they stack the features to form an image-like matrix.
  • Multi-cue fusion
    • feature level
      • concatenate output from 3 NN models and train an SVM to classify.
    • decision level
      • Weighted sum of the output from each model


  • I have a question: in CNN+BRNN, they extract the temporal changes of texture, which is the muscle movement. Then they extract the landmark changes, which is again the muscle motion. Wouldn’t it be redundant?
  • They are not really SOTA but have a comparative result.

Emotion Recognition From Audio-Visual Data Using Rule Based Decision Level Fusion

Three levels of data fusion:

  • Feature
    causes the sparseness of data
  • Classifier
    The system becomes more complex[Note: why?]
  • Decision
    So they fuse at decision level.

The idea is quite simple: they train an image-only using Linear Binary Pattern(LBP) and an audio-only model using Mel-Frequency cepstral coefficients(MFCC), and compare them with an audio-visual fusion model.

Preprocessing of speech:

  • Pre-emphasis, framing, windowing. [Discrete-time speech signal processing: principles and practice]
  • MFCC is based on the characteristics of the human auditory system.
    Mel scale: perceiving doubled frequency in a logarithmic manner
  • Changing the upper limit of frequency and fixed lower limit as 0
  • Male and female are considered separately
  • HMM model

Preprocessing of image

  • LBP represents the textures in an image: local binary pattern operator that transforms an image into an array or integer labels describing the small-scale appearance of the image.
  • SVM for classification


  • Silence regions are discarded.
  • Rule:
    when there is confusion(several emotions having similar scores), they accept the result containing less confusion.
  • Image system has higher priority

Analysis of emotion recognition using facial expressions, speech and multimodal information

Note: this paper is done in 2004. Thus there is no much Deep Learning. Most models are statistical models.

An analysis aiming at comparing the performance of unimodal and multimodal system.

This experiment confirms that using multimodal systems is beneficial for emotion recognition.

  • Paris of emotions that were confused in one modality are easily classified in the other
  • the best approach(decision-level or feature level) to fuse the modalities will depend on the application
  • Proved that audio and visual data are complementary information

Exploring cross-modality affective reactions for audiovisual emotion recognition

Objectif: interaction of acoustic features of speaker and facial expressions of the interlocutor

The entrainment between interaction is very interesting!
the expressive behaviors from one subject should be correlated with the behaviors of his/her conversation partner

IEMOCAP database

one challenge in the interaction study:
The emotional state is only available for the active speaker, thus while the subject is listening, there is no label assigned for its emotional state.
To tackle this problem, they assume that the consistency of the emotion over time. That is, when the subject is listening, he is supposed to remain in the same emotional state is before and after. Thus, they assigned the emotion label by majority vote among the previous and following emotions.

This technique is named as emotion interpolation.

This interpolation has been validated.

They only consider the case where A is listening(highlight in grey)

Facial and acoustic features:

Facial features are from markers.

Example of LLDs:

openSMILE toolkit to extracted the acoustic features of 4368D.

They used correlation feature selection(CFS) to select a set of features having high correlation with the emotion labels but low correlation between themselves.

Entrainment analysis

Mutual information as metric.: to quantify the dependences rather than similarities

Four scenarios considered in the paper
  • (a)
    • 72% of time, they share the same emotion
  • (b):
    • facial gestures of the listeners are complementary to speakers emotion
    • acoustic features of speakers are complementary to listeners’ emotion
  • (c):
    • Speaker’s voice and the listener’s facial expression are correlated
  • (d):
    • whether the cross-subject behaviors are complementary to or redundant with the own behaviors displayed by the subject?
    • They are complementary!
Experiments(To be read)

Experiments show that using complementary information from interaction help improve emotion recognition accuracy.

Towards Efficient Multi-Modal Emotion Recognition

audio-visual based.

Emotion Recognition From Posed and Spontaneous Dynamic Expressions: Human Observers Versus Machine Analysis

  • Posed emotion results in better performance but their exaggerated nature makes the trained model suffer from a drop of performance in real-world emotion recognition.
  • Spontaneous expressions often contain complex action pattens.
  • This work tests spontaneous affective emotion
  • Human
  • FACET software

Specifically, they are interested in prototypicality of the emotion.
[Note: I think the prototypicality measures whether an emotion is posed. More likely posed, higher the prototypicality]

  • Overall, the machine is slightly better
  • Posed:
    • Machine significantly better
  • spontaneous:
    • Less difference between machine and human
  • Higher the prototypicality, better the accuracy for machine


  • They have used numerous databases to ensure a great variety of emotion type.
  • It is believed that having additional options such as “no emotion”/”other emotion” gives a similar result to the traditional approach: forced choice with only 6 emotion allowed
  • The whole paper basically describes the experiment they have done. The Result section is full of data.
  • Computer-based systems perform as well as and often better than human judges.


Convolutional MKL Based Multimodal Emotion Recognition and Sentiment Analysis

End-to-End Speech Emotion Recognition Using Deep Neural Networks


Emotion and Nonverbal Communication in the 2020 Democratic Presidential Debates

Investigate the contribution of facial expression and the prosodic features to the changes in favorability rating.

  • Facial expression:
    • 6 basic emotions:
      happiness, sadness, surprise, anger, disgust, fear + nrutral
  • Audio features:
    • Pitch
    • Loudness

Metric: net favorability change

Facial expression recognition:
  • Face recognition
    Using dlib’s tool
  • Face verification
    dlib’s tool.
    Allows distinguishing between candidates and other persons.
    By removing the far away frame shots, the overall accuracy is of 0.99
    The error comes from not identifying a candidate, rather than misidentifying faces
  • Facial expression recognition
    Pretrained Resnet34 model.
    Data augmentation by re-cropping and perturbation
    Cross-Entropy as loss unction, Adam Optimizer, no frozen layer
    The recognition accuracy is 0.714.
    Top two accuracy is of 0.9 since some expression are hard to tell
  • Filter out irrelevant frames:
  • Discussion
    DIverse training image to ensure the generalizable model
    The performance of the model is close to that of a human: 0.60 – 0.70, which makes sense since they are stimulating human observers’ FER.
  • Note
    They use a frame-based(image-level) method for expression recognition, maybe a sequence-based can be implemented.
Audio extraction

Pitch and loudness: average + variance


  • Normalize loudness by debate
    Because the recording 1 and 2 have higher overall loudness
  • Normalize pitch by gender
    women have higher overall pitch than men

Features from video:

  • An average softmax score for each expression class of each candidate in each debate
  • A feature to measure uncertainty and predictability of the candidate’s expression
  • average screen time proportion
  • Use raw softmax scores instead of the prediction

Significance: Variation Inflation Factor(VIF)


  • Evaluating significance
  • Pairwise Correlation
  • density distribution of expression


  • There are some false sad facial expressions. I think that is because they used an image-based method only. In a sequence, this problem might be resolved.
  • Possible topic: how different classes of Democratic candidates conduct themselves in the public eye


AVEC: Audio/Visual Emotion Challenge

FERA: Facial Expression Recognition and Analysis

EmotiW: Emotion recognition in the wild chanllenge



Leave a Reply

Your email address will not be published. Required fields are marked *