Speech Emotion Recognition

Speech is a powerful tool for verbally expressing attitudes and feelings. Speech emotion recognition is the practice of utilizing AI models to derive emotion from any given human speech. Finding emotional content in a speech signal and recognizing emotions in speech utterances is critical for researchers. Over the past decade, speech emotion recognition has been regarded as an essential area for research. Human-computer interaction research has recently expanded to include the study of emotion identification from speech. Various fields, including psychology, medicine, education, and entertainment, may gain from this application. To understand the emotion behind a speech, relevant characteristics must be extracted from it.

Introduction

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a program to process human speech into a written format. While it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice. 
Speech recognition systems use computer algorithms to process and interpret spoken words and convert them into text. A software program turns the sound a microphone records into written language that computers and humans can understand, following these four steps:
  1. Analyze the audio
  2. break it into parts
  3. digitize it into a computer-readable format
  4. use an algorithm to match it to the most suitable text representation
Speech recognition software must adapt to the highly variable and context-specific nature of human speech. The software algorithms that process and organize audio into text are trained on different speech patterns, speaking styles, languages, dialects, accents, and phrasings. The software also separates spoken audio from background noise that often accompanies the signal.

Speech Emotion Recognition is a task of speech processing and computational paralinguistics that aims to recognize and categorize the emotions expressed in spoken language. The goal is to determine the emotional state of a speaker, such as happiness, anger, sadness, or frustration, from their speech patterns, such as prosody, pitch, and rhythm.

Applications of SER

  1. Sentiment Analysis
  2. Call centres
  3. Voice Assistants

Mel frequency cepstral coefficient(MFCC) is an algorithm that comprises a filter bank called the Mel-filter bank which models the characteristics of the human ear. The most important aspect of the MFCC algorithm is the Mel-filter bank whose characteristics are observed to invariably model the perception characteristics of the human ear. It can be represented by a set of triangular bandpass filters, whose frequency response is high in the low-frequency regions and low in the high-frequency regions. The arrangement of the Mel-filter bank is such that the edges of every band coincide with the center of the corresponding neighboring band. MFCC extraction involves a series of steps to transform audio signals for use in speech recognition and analysis.

  1. Pre-emphasis
  2. Framing
  3. Windowing
  4. Fast Fourier Transform (FFT)
  5. Mel Filter Bank
  6. Logarithm
  7. Discrete Cosine Transform (DCT)

Recurrent Neural Network(RNN) is a type of artificial neural network that uses sequential data or time series data. These deep learning algorithms are commonly used for ordinal or temporal problems, such as language translation, natural language processing (NLP), speech recognition, and image captioning; they are incorporated into popular applications such as Siri, voice search, and Google Translate.

Like feedforward and convolutional neural networks (CNNs), recurrent neural networks utilize training data to learn. They are distinguished by their “memory” as they take information from prior inputs to influence the current input and output. While traditional deep neural networks assume that inputs and outputs are independent of each other, the output of recurrent neural networks depends on the prior elements within the sequence.

Proposed model

I am using a Recurrent neural network(RNN) involving only one hidden layer. The input takes the MFCC, Mel, and Chroma of each audio file, then it goes through the hidden layer consisting of __ neurons, and the output layer consists of 7 neurons, for the emotions- happy, sad, disgust, anger, surprised, neutral, and fear.

The full code is given below: Colab File: https://colab.research.google.com/drive/1lW9wX1YBPo3JOwGhot_mWsfF2H_9gqx_?usp=sharing GitHub Link:

Comments

Popular Posts