[GSoC2020]Audio-visual on IEMOCAP

Multimodal Sentiment Analysis: Addressing Key Issues and Setting Up the Baselines

Setting up the baseline in 2018 on IEMOCAP

3DCNN and openSMILE for visual and acoustic feature extraction, then concatenate the feature vectors.

  • Baseline method:
    • BiLSTM to capture the context
    • SVM with RBF kernel, thus no context

IEMOCAP: exclude one speaker at a time

Context-Dependent Sentiment Analysis in User-Generated Videos

They emphasized the order of utterances: treat surrounding utterances as the context

  • Audio features:
    • openSMILE
    • frame rate 30Hz and a sliding window of 100 ms
    • Voice normalization and voice intensity threshold
    • 6373 features
  • Visual features
    • 3D CNN
    • 32 feature map with size 5x5x5
    • max-pooling of 3x3x3

– when classifying one utterance, other utterances can provide important context
– LSTM: context-dependent feature extraction

  • Different architecture
    • sc-LSTM
      • unidirectional LSTM
    • h-LSTM
      • Omit the dense layer after LSTM
    • bc-LSTM
      • bi-directional LSTM
    • uni-SVM
      • concatenate unimodal features and do the classification

Context-Sensitive Learning for Enhanced Audiovisual Emotion Classification

Their idea is that the emotion is continuous, which may last more than one minute. So we can use the precedent emotion as context

Definition of context: information about the emotional content of audio-visual displays that happen before or after the observation

Three structures, HMM + HMM, feautre + BLSTM, HMM + LSTM

They use the dimensional representation: valence and activation

Frame level features

Facial features: 46 markers
PFA to eliminate redundant feature points, first derivatives
normalization of landmarks

Audio features:
12 MFCC, 27 MFB, pitch, and energy
first derivative

Utterance level statistics

They only use IEMOCAP

Audio-Visual Emotion Recognition using Gaussian Mixture Models for Face and Voice

Audio-Visual Emotion Recognition using Gaussian Mixture Models
for Face and Voice

Face model:
Separate face to 6 regions and GMM is trained for each emotion both for facial and vocal modalities.
each feature vector is the 3D coordinates + first and second derivatives
GMM with 64 mixtures has the best performance

Voice model:
feature vector: 12 MFCC and their first and second derivatives
voice window of length 50ms and 25ms overlap
Train a background noise model, thus in total 5 models

– Bayesian

There are 7 classifiers(6 facial + 1 voice), each has 4 GMMs(except 5 for voice). Label assigned by majority vote

SVM: RBF kernel
Use the classification percentage(probability) from each classifier as features.


audio + mocap + text

Only consider emotions with at least 2 consistent annotation

34 MFCC features
0.2 window and stride 0.1
max 100 frames no delta nor delta-delta, thus 100 x 34

split into 200 arrays, then average over each array, thus 200 x 189

  • Network:
    • Audio:
      • MLP of 3 hidden layers
      • stacked 2 LSTM + Dense layer
      • LSTM + Attention
      • BiLSTM + Attention
    • Mocap
      • 2D CNN
    • Fusion by concatenation of features

Emotion Recognition in Audio and Video Using Deep Neural Networks

They used audio spectrogram as audio input

They use librosa library to extract audio spectrogram with a sample rate of 44KH

– entire length
– clip of 3 seconds(they don’t specify if the clips are overlapping)

To remove the noise:
Bandpass filter between 1Hz nd 30kHz

They have also checked the spectrogram, noticing that 60% of the image would be blue(as below), thus they cropped the spectrogram. Another way would be specifying the range.

20 frames per 3 second
crop face from video, resulting 60 x 100 size.

They also encountered memory problem while processing videos.


Data augmentation

They first pre-trained the model in a semi-supervised manner:
contrastive loss:

Their accuracy is not higher than 54%

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

It is a paper in speech domain, but the network architecture might be interesting to look at.

Only used mouth region

They used 0.3-second of a videoclip

Audio processing:

Video processing:

Only the mouth region is considered

What is interesting to see is that the kernel is not square and the width is decreasing
The authors believed that it allows more temporal features in lover level and more correlated features in the high-level features.

It indeed here the authors suggest using contrastive loss.

Deep-emotion: Facial expression recognition using attentional convolutional network

SOTA image-based emotion recognition. To know better about how to deal with facial expression recognition

Their visualization is quite interesting, showing the region that the attention mechanism highlights.


A spatial transformer network for attention network.

The network does not need to be very deep. More than 50 layers doesn’t show much improvement. Hence:

Facial emotion recognition from videos using deep convolutional neural network

Typical work on emotion recognition using video.

Video processing:
– 13 frames/s
– grey-scale
– Histogram equalization
– Face detection and cropping and resizing

This work is not god. I don’t understand why other work cite it….

Video-based emotion recognition using CNN-RNN and C3D hybrid networks

Dataset: AFEW 6.0

  • Combining RNN and C3D can improve video-based emotion recognition.
  • LSTM is similar to this work
  • C3D
    • models both appearance and motion simultaneously ( in other words, the spatial-temporal information)
  • Hybrid networks
    • Considering audio improve 3%
      • features extracted with OpenSmile
    • Score-level fusion, weighted by the performance on the validation set
  • Implementation
    • Face are extracted and aligned
    • CNN is VGG16-face fine-tuned with FER2013
    • 16 face features are randomly selected as inputs for LSTM
    • 16 frames of each video clip for C3D
    • As test time, sliding window of 16 frames with stride of 8, repeat last frame if necessary. Finally, average the scores of all segments
  • Video preprocessing
    • Training on optical flow images can improve the accuracy
  • Different architecture
    • CNN:
      • VGG, GoogLeNet, ResNet, etc
    • VGG + LSTM + different nb of layers/ hiddent units
    • C3D : sport1m

Recurrent Neural Networks for Emotion Recognition in Video

CNN-RNN architecture

Co-training of audio and video representations from self-supervised temporal synchronization

Self-supervised model for cooperative learning of audio and video. This work pre-train the model using supervised learning

Multi-Modal Speech Emotion Recognition Using Speech Embeddings andAudio Features

This work is mainly about speech.

They use a acoustic embedding to encode the acoustic features.


Multi label to deal with the ambiguity in one-label training.

They used the emotion distribution produced by human annotators in training in inference, that is:

if six annotators label an utterance as “happy,” two annotators label it “neutral,” one annotator labels it “sad,” and one annotator labels it “frustration” (out of the five classes of “angry,” “happy,” “neutral,” “sad”, and “frustration”), we then define and represent the ground truth of the utterance as a four-dimensional representation of {0, 0.6, 0.2, 0.1, 0.1}

They used ELM(Extreme learning machine) for classification.

Learning Affective Features With a Hybrid Deep Model for Audio–Visual Emotion Recognition

audio-visual segment features with CNN and 3D CNN:
audio -> CNN
video -> 3D CNN
Then fuse the features in a Deep Belief Networks

It is not validated in IEMOCAP
It uses RML, eNTERFACE05, BAUM1s

The model the non-linear correlation of multiple inut with different statistical properties, namely different modals, they use DBN for this purpose

Different fusion strategy:

feature-level fusion can not model the complicated relationships, e.g., the difference on time scales and metric levels, between audio and visual modalities

can not capture the mutual correlation among different modalities, because these modalities are assumed to be independent. Therefore, decision-level fusion does not conform to the fact that human beings show audio and visual expressions in a complementary redundant manner, rather than a mutually independent manner

score-level fusion is implemented by combining the individual classification scores, which indicate the likelihood that a sample belongs to different classes. By contrast, decision-level fusion is performed by combining multiple predicted class labels.

The implementation is based on the used fusion model
HMM is often used for this fusion strategy

The existing work uses a shallow fusion method

audio : AlexNet
visual : C3D-Sports-1M

split each of them into a certain number of overlapping segments and then learn audio-visual features from each segment

Audio input
transform the 1-D spectrogram into a 2-D array as the input of CNN.
64 filter x 64 frames x 3 channel

Visual input
16 frames

For each segment. they have a feature vector and then apply the average pooling to have a fixed size feature vector of the entire utterance.

Audio and FaceVideo Emotion Recognition in the Wild using Deep Neural Networks and Small Datasets

Example of feature extraction for audio and visual

Fusion of facial expression recognition and audio emotion recognition subsystems at score level

Use cnn to extract features, then calculate video features, ( eigenvector, covariance matrix, multi dimensional Gaussian Distribution)

Audio: We then take the means of frame-wise scores as the final prediction results for video clips
Geneva Minimalistic Acoustic Parameter Set (eGeMAPS)

Audio-visual fusion at score level

Dense SIFT and CNN features for face video emotion recognition

Deep Multimodal Learning for Affective Analysis and Retrieval

Prove that deep bolzmann machine can learn high-level presentation of the auvio-visual cues from low-level hand-crafted features

Emotion Spotting: Discovering Regions of Evidence in Audio-Visual Emotion

Explore which region is more salient for a certain expression.