M-CTC-T Model

The M-CTC-T is a multilingual model i.e. a single model can transcribe the audio in all the 60 languages it was trained. It is a 1 Billion parameter model.

M-CTC-T Model Architecture

Below is the model architecture of the M-CTC-T model

M-CTC-T Model

For the M-CTC-T-Large model below are the main components along with their dimensions:

INPUT – The input to the encoder is a sequence of 80-dimensional log mel filterbank frames, extracted using 25 ms Hamming windows every 10ms from the 16 kHz audio signal. The maximum sequence length is 920
CONV 1D – There is a single gated convolution layer (CONV1d + GLU)which performs convolution along the time axis(1D). This takes 80 input features of the log mel spectrum and convert it to a 3072 output feature. The filter length is 7 and it has a stride of 3 with valid padding. The original paper had only one convolution layer. However the hugging face implementation has option of specifying multiple CONV layer through the config parameter ‘num_conv_layers’. GLU halves the output features- Given a tensor, we do two independent convolutions and get two output . We do sigmoid activation for one of the outputs. We then element-wise multiply the two outputs together. Thus the GLU halves the output features to 1536.
ENCODER – These consist of 36 layers of below :
1. SELF ATTENTION – 4 heads of self-attention each of size 384. Thus the four attention combined take an input of 4*384 =1536 which is the output size of the convolution layer after GLU.
2. INTERMEDIATE LAYER – The encoder output is feed to intermediate layer which is a linear layer that transforms the feature vector from 1536 to 6144.
3. FEED FORWARD LAYER – The feedforward layer takes an input of 6144 and transforms it back to 1536. The final output of the encoder block is Batch * SeqLen * 1536
CTC Head – The CTC head is a linear layer with 8065 outputs. One for each character of all the 60 languages. including punctuation, space, and the CTC blank symbol.
LID Head – The Lid head is a linear layer with 60 outputs – one for each language followed by mean pooling to aggregate along the sequence length. These LID outputs are used only during training.

M-CTC-T Model Training

The objective of the author while developing the M-CTC-T model was to demonstrate the use of pseudo-labeling for multilingual ASR tasks.

What is Pseudo labelling?

Pseudo-labeling is a semi-supervised technique that is used when we want to use unlabeled data. It is useful when we have a small set of labeled data along with a large set of unlabeled data. In pseudo labelling we keep adding add Pseudo-labeling in general consists of below steps:

INITIAL TRAINING : Train a initial baseline model on the available labelled data set
PREDICTIONS ON UNLABELED DATA : Use the model obtained in Step 1 to predict labels of the unlabeled datasets. These are know as Pseudo Labels (PL).
AUGMENT TRAINING DATASET : The pseudo labels generated is combined with the initial labeled data set used for generating baseline model. Here one can also include only those pseudo labels to be included which has a high prediction score/confidence.
RETRAINING: Now either retrain the same model or train a new model from scratch on the pseudo labels and labeled dataset combined.
ITERATIONS : Steps 2 to 4 can be repeated to improve model robustness.

One must be careful in using this technique as too many of noisy sample in Pseudo labels will negatively impact the model performance. However this technique has shown to improve performance in ASR task considerably and gained momentum especially in speech related task.

Pseudo Label Process

The above is a general approach and many variants exist. For the ASR task the author adopted the below process

There are two open source data available for the ASR task-
- CV (Common Voice) which is labelled and consists of samples from 60 languages
- VP (Vox Populli) which is unlabeled and consists of samples from 23 languages. 19 languages are common between CV and VP dataset.
The author of the M-CTC-T model followed a technique which is based on sLimIPL (Language-Model-Free Iterative Pseudo-Labeling) developed by the Facebook .
The author first trained a model for several updates on labeled data of the CV dataset. After this step, the multilingual model is obtained.
Then the model was first fine tuned for a particular language for which pseudo label had to be generated. This fine-tuned model generated pseudo labels of the VP dataset for the language in which it was trained. In total 19 slimIPL models were developed to generate PLs for the 19 languages that were common between the CV and VP dataset.
The PL of all languages were pooled together along with the labelled data of CV and a new model was trained from scratch. The author found it that training a new model from scratch yielded better results then training the the non-fine-tuned multilingual model checkpoint (model obtained after step 3 before running slimIPL).

Automatic Speech recognition

M-CTC-T Model in NLP

Automatic Speech Recognition (ASR) stands at the forefront of cutting-edge Natural Language Processing (NLP) applications, revolutionizing the way computers interact with spoken language. In this article, we embark on a journey to unravel the intricacies of an advanced ASR model known as M-CTC-T. This model, rooted in the principles of Connectionist Temporal Classification (CTC) and Transformer architecture, exemplifies the fusion of simplicity and sophistication. As we delve into its architecture, training methodology, and a unique fine-tuning approach involving pseudo-labeling, readers will gain insights into the inner workings of a state-of-the-art ASR solution.

M-CTC-T Model

M-CTC-T Model Architecture

M-CTC-T Model Training

What is Pseudo labelling?

M-CTC-T Model in NLP

Categories

Contact US

M-CTC-T Model

M-CTC-T Model Architecture

M-CTC-T Model Training

What is Pseudo labelling?

M-CTC-T Model in NLP

Similar Reads

Categories

Contact US