Transformers
The transformer architecture has the ability to process all the parts of input in parallel through its self-attention mechanism without the need to sequentially process them.
The transformer architecture has two parts: an encoder and a decoder. The left side is the encoder, and the right side is the decoder. If we want to build an application to convert a sentence from one language to another (English to French), we need to use both the encoder and decoder blocks. This was the original problem (known as a sequence-to-sequence translation) for which the transformer architecture was developed. However, depending on the type of task, we can either use the encoder block only or the decoder block only of the transformer architecture.
- For example, if we want to classify a sentence or a review as positive or negative, we need to use only the encoder part. The popular BERT is encoder-based, meaning it is built only using the encoder block of the transformer architecture.
- If we want to build an application for Question Answering, we can use the decoder block. The Chat GPT is a decoder-based model, meaning it is built using the decoder block of the transformer architecture.
The core of the encoder and decoder blocks is multi-head attention. The only difference is the use of masking in the decoder block. These layers tell the model to pay specific attention to certain elements in the input sequence and ignore others when computing the feature representations.
Audio Transformer
From revolutionizing computer vision to advancing natural language processing, the realm of artificial intelligence has ventured into countless domains. Yet, there’s one realm that’s been a consistent source of both fascination and complexity: audio. In the age of voice assistants, automatic speech recognition, and immersive audio experiences, the demand for robust, efficient, and scalable methods to process and understand audio data has never been higher. Enter the Audio Transformer, a groundbreaking architecture that bridges the gap between the visual and auditory worlds in the deep learning landscape.