Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → maelfabien → Multimodal Emotion Recognition

maelfabien / Multimodal Emotion Recognition

Licence: apache-2.0

A real time Multimodal Emotion Recognition web app for text, sound and video inputs

Programming Languages

python

139335 projects - #7 most used programming language

Labels

jupyter-notebook deep-learning tensorflow keras real-time

Projects that are alternatives of or similar to Multimodal Emotion Recognition

Pytorch Multi Style Transfer

Neural Style and MSG-Net

Stars: ✭ 687 (+78.91%)

Mutual labels: jupyter-notebook, real-time

Icnet Tensorflow

TensorFlow-based implementation of "ICNet for Real-Time Semantic Segmentation on High-Resolution Images".

Stars: ✭ 396 (+3.13%)

Mutual labels: jupyter-notebook, real-time

Mobilepose Pytorch

Light-weight Single Person Pose Estimator

Stars: ✭ 427 (+11.2%)

Mutual labels: jupyter-notebook, real-time

Keras realtime multi Person pose estimation

Keras version of Realtime Multi-Person Pose Estimation project

Stars: ✭ 728 (+89.58%)

Mutual labels: jupyter-notebook, real-time

Xpedite

A non-sampling profiler purpose built to measure and optimize performance of ultra low latency/real time systems

Stars: ✭ 89 (-76.82%)

Mutual labels: jupyter-notebook, real-time

Optical Flow Filter

A real time optical flow algorithm implemented on GPU

Stars: ✭ 146 (-61.98%)

Mutual labels: jupyter-notebook, real-time

Multi Task Refinenet

Multi-Task (Joint Segmentation / Depth / Surface Normas) Real-Time Light-Weight RefineNet

Stars: ✭ 139 (-63.8%)

Mutual labels: jupyter-notebook, real-time

Pytorch realtime multi Person pose estimation

Pytorch version of Realtime Multi-Person Pose Estimation project

Stars: ✭ 205 (-46.61%)

Mutual labels: jupyter-notebook, real-time

Powerlaw

Stars: ✭ 379 (-1.3%)

Mutual labels: jupyter-notebook

Dawp

Jupyter Notebooks and code for Derivatives Analytics with Python (Wiley Finance) by Yves Hilpisch.

Stars: ✭ 383 (-0.26%)

Mutual labels: jupyter-notebook

Daily Deeplearning

🔥机器学习/深度学习/Python/算法面试/自然语言处理教程/剑指offer/machine learning/deeplearning/Python/Algorithm interview/NLP Tutorial

Stars: ✭ 381 (-0.78%)

Mutual labels: jupyter-notebook

Yolo V5

🎨 Pytorch YOLO v5 训练自己的数据集超详细教程！！！ 🎨 (提供PDF训练教程下载）

Stars: ✭ 379 (-1.3%)

Mutual labels: jupyter-notebook

Transformers Tutorials

Github repo with tutorials to fine tune transformers for diff NLP tasks

Stars: ✭ 384 (+0%)

Mutual labels: jupyter-notebook

Python4datascience.ch

从0开始接触Python处理数据科学问题。包含Python0基础入门、科学计算工具入门、数学与计算机基础入门、统计学习入门。

Stars: ✭ 381 (-0.78%)

Mutual labels: jupyter-notebook

Njucs Course Material From Yikaizhang

南京大学计算机系课程资料作业代码实验报告 NJU-CS 课程分享计划 🍚

Stars: ✭ 383 (-0.26%)

Mutual labels: jupyter-notebook

Openpose train

Training repository for OpenPose

Stars: ✭ 381 (-0.78%)

Mutual labels: real-time

Deep Learning From Scratch

深度学习入门-基于Python的理论与实现》，包含源代码和高清PDF(带书签)；慕课网imooc《深度学习之神经网络(CNN-RNN-GAN)算法原理-实战》

Stars: ✭ 381 (-0.78%)

Mutual labels: jupyter-notebook

Vae cf

Variational autoencoders for collaborative filtering

Stars: ✭ 386 (+0.52%)

Mutual labels: jupyter-notebook

Touchdesigner shared

TouchDesigner toxes and small projects

Stars: ✭ 385 (+0.26%)

Mutual labels: real-time

2020 benfords

Stars: ✭ 384 (+0%)

Mutual labels: jupyter-notebook

View All Similar Projects ➔

Real-Time Multimodal Emotion Recognition

Don't hesitate to ⭐ the repo if you enjoy our work !

In a nutshell

We developped a multimodal emotion recognition platform to analyze the emotions of job candidates, in partnership with the French Employment Agency.

We analye facial, vocal and textual emotions, using mostly deep learning based approaches. We deployed a web app using Flask :

The tool can be accessed from the WebApp repository, by installing the requirements and launching main.py.

We have also written a paper on our work : https://www.overleaf.com/read/xvtrrfpvzwhf

Table of Content :

In this project, we are exploring state of the art models in multimodal sentiment analysis. We have chosen to explore text, sound and video inputs and develop an ensemble model that gathers the information from all these sources and displays it in a clear and interpretable way.

0. Technologies

I. Context

Affective computing is a field of Machine Learning and Computer Science that studies the recognition and the processing of human affects. Multimodal Emotion Recognition is a relatively new discipline that aims to include text inputs, as well as sound and video. This field has been rising with the development of social network that gave researchers access to a vast amount of data.

II. Data Sources

We have chosen to diversify the data sources we used depending on the type of data considered. All data sets used are free of charge and can be directly downloaded.

For the text input, we are using the Stream-of-consciousness dataset that was gathered in a study by Pennebaker and King [1999]. It consists of a total of 2,468 daily writing submissions from 34 psychology students (29 women and 5 men whose ages ranged from 18 to 67 with a mean of 26.4). The writing submissions were in the form of a course unrated assignment. For each assignment, students were expected to write a minimum of 20 minutes per day about a specific topic. The data was collected during a 2-week summer course between 1993 to 1996. Each student completed their daily writing for 10 consecutive days. Students’ personality scores were assessed by answering the Big Five Inventory (BFI) [John et al., 1991]. The BFI is a 44-item self-report questionnaire that provides a score for each of the five personality traits. Each item consists of short phrases and is rated using a 5-point scale that ranges from 1 (disagree strongly) to 5 (agree strongly). An instance in the data source consists of an ID, the actual essay, and five classification labels of the Big Five personality traits. Labels were originally in the form of either yes (‘y’) or no (‘n’) to indicate scoring high or low for a given trait.
For audio data sets, we are using the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). This database contains 7356 files (total size: 24.8 GB). The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions, and song contains calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity(normal, strong), with an additional neutral expression. All conditions are avail-able in three modality formats: Audio-only (16bit, 48kHz .wav), Audio-Video(720p H.264, AAC 48kHz, .mp4), and Video-only (no sound).” https://zenodo.org/record/1188976#.XCx-tc9KhQI
For the video data sets, we are using the popular FER2013 Kaggle Challenge data set. The data consists of 48x48 pixel grayscale images of faces. The faces have been automatically registered so that the face is more or less centered and occupies about the same amount of space in each image. The data set remains quite challenging to use, since there are empty pictures, or wrongly classified images. https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data

III. Download

Modality	Data	Processed Data (for training)	Pre-trained Model	Colab Notebook	Other
Text	here	X-train y-train X-test y-test	Weights Model	---	---
Audio	here	X-train y-train X-test y-test	Weights Model	Colab Notebook	---
Video	here	X-train y-train X-test y-test	Weights Model	Colab Notebook	Face Detect Model

IV. Methodology

Our aim is to develop a model able to provide a live sentiment analysis with a visual user interface.Therefore, we have decided to separate two types of inputs :

Textual input, such as answers to questions that would be asked to a person from the platform
Video input from a live webcam or stored from an MP4 or WAV file, from which we split the audio and the images

a. Text Analysis

Pipeline

The text-based personality recognition pipeline has the following structure :

Text data retrieving
Custom natural language preprocessing :
- Tokenization of the document
- Cleaning and standardization of formulations using regular expressions
- Deletion of the punctuation
- Lowercasing the tokens
- Removal of predefined stopwords
- Application of part-of-speech tags on the remaining tokens
- Lemmatization of tokens using part-of-speech tags for more accuracy.
- Padding the sequences of tokens of each document to constrain the shape of the input vectors.
300-dimension Word2Vec trainable embedding
Prediction using our pre-trained model

Model

We have chosen a neural network architecture based on both one-dimensional convolutional neural networks and recurrent neural networks. The one-dimensional convolution layer plays a role comparable to feature extraction : it allows finding patterns in text data. The Long-Short Term Memory cell is then used in order to leverage on the sequential nature of natural language : unlike regular neural network where inputs are assumed to be independent of each other, these architectures progressively accumulate and capture information through the sequences. LSTMs have the property of selectively remembering patterns for long durations of time. Our final model first includes 3 consecutive blocks consisting of the following four layers : one-dimensional convolution layer - max pooling - spatial dropout - batch normalization. The numbers of convolution filters are respectively 128, 256 and 512 for each block, kernel size is 8, max pooling size is 2 and dropout rate is 0.3. Following the three blocks, we chose to stack 3 LSTM cells with 180 outputs each. Finally, a fully connected layer of 128 nodes is added before the last classification layer.

b. Audio Analysis

Pipeline

The speech emotion recognition pipeline was built the following way :

Voice recording
Audio signal discretization
Log-mel-spectrogram extraction
Split spectrogram using a rolling window
Make a prediction using our pre-trained model

Model

The model we have chosen is a Time Distributed Convolutional Neural Network.

The main idea of a Time Distributed Convolutional Neural Network is to apply a rolling window (fixed size and time-step) all along the log-mel-spectrogram. Each of these windows will be the entry of a convolutional neural network, composed by four Local Feature Learning Blocks (LFLBs) and the output of each of these convolutional networks will be fed into a recurrent neural network composed by 2 cells LSTM (Long Short Term Memory) to learn the long-term contextual dependencies. Finally, a fully connected layer with softmax activation is used to predict the emotion detected in the voice.

To limit overfitting, we tuned the model with :

Audio data augmentation
Early stopping
And kept the best model

c. Video Analysis

Pipeline

The video processing pipeline was built the following way :

Launch the webcam
Identify the face by Histogram of Oriented Gradients
Zoom on the face
Dimension the face to 48 * 48 pixels
Make a prediction on the face using our pre-trained model
Also identify the number of blinks on the facial landmarks on each picture

Model

The model we have chosen is an XCeption model, since it outperformed the other approaches we developed so far. We tuned the model with :

Data augmentation
Early stopping
Decreasing learning rate on plateau
L2-Regularization
Class weight balancing
And kept the best model

As you might have understood, the aim was to limit overfitting as much as possible in order to obtain a robust model.

To know more on how we prevented overfitting, check this article : https://maelfabien.github.io/deeplearning/regu/
To know more on the XCeption model, check this article : https://maelfabien.github.io/deeplearning/xception/

The XCeption architecture is based on DepthWise Separable convolutions that allow to train much fewer parameters, and therefore reduce training time on Colab's GPUs to less than 90 minutes.

When it comes to applying CNNs in real life application, being able to explain the results is a great challenge. We can indeed plot class activation maps, which display the pixels that have been activated by the last convolution layer. We notice how the pixels are being activated differently depending on the emotion being labeled. The happiness seems to depend on the pixels linked to the eyes and mouth, whereas the sadness or the anger seem for example to be more related to the eyebrows.

d. Ensemble Model

The ensemble model has not been implemented on this version.

V. How to use it ?

There are several resources available :

the working notebooks can be found in the Text/Video/Audio sections
the final notebooks can be accessed through the Google Colab link in the table at the beginning

To use the web app :

Clone the project locally
Go in the WebApp folder
Run `$ pip install -r requirements.txt``
Launch python app.py

VI. Research Paper

If you are interested in the research paper we are working on currently, feel free to check out this link : https://www.overleaf.com/read/xvtrrfpvzwhf

VII. Contributors

_{Anatoli-deBRADKE}
💻

_maelfabien
💻

_{RaphaelLederman}
💻

_STF-R
💻

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 384

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (8) 🔗