How Seq2Seq models do Speech Recognition?

Seq2Seq model consist of two main components encoder and decoder connected through a cross attention mechanism. Here we will discuss how the encoder and decoder with attention mechanism can help us achieve an ASR:

Encoder Processing and Feature Extraction: The encoder processes raw audio data as its input. As this data is not initially in a machine-understandable format, the encoder transforms it using common techniques like spectrogram representations, Mel-frequency cepstral coefficients (MFCCs), or other time-frequency transformations. Subsequently, it extracts crucial features, such as pitch, intensity, and spectral content, collectively known as the acoustic characteristics of sound. These features play a vital role in comprehending spoken language.
Sequence-to-Sequence Architecture: The sequence-to-sequence architecture capture the sequential order of words in a sentence. This step ensures that the model understands the significance of word order in language.
Attention Mechanism: The encoded or extracted features are then fed into an attention mechanism within the model. This attention mechanism focuses on significant parts of the spoken words, aiding in the identification of the words being spoken.
Data Flow to the Decoder: The processed data is passed to the decoder, responsible for translating the machine-formatted data into human-readable text.
Language Model Integration: During the decoding process, a language model is utilized. This model contains a vast corpus of data aids in determining the correct placement of words in the transcribed text, ensuring the coherence and accuracy of the final output by calculating word probabilities in text.
Translation to Human-Readable Text: The decoder, armed with the language model’s knowledge, executes the final translation. It converts the data processed by the machine, enhanced with acoustic features and guided by attention mechanisms, into a meaningful and understandable textual representation for humans.

Automatic Speech Recognition using Whisper

Automatic Speech Recognition (ASR) can be simplified as artificial intelligence transforming spoken language into text. Its historical journey dates back to a time when developing ASR posed significant challenges. Addressing diverse factors such as variations in voices, accents, background noise, and speech patterns proved to be formidable obstacles.

How Seq2Seq models do Speech Recognition?

Automatic Speech Recognition using Whisper

Categories

Contact US

How Seq2Seq models do Speech Recognition?

Automatic Speech Recognition using Whisper

Similar Reads

Categories

Contact US