Description of RAG Models
RAG models combine parametric memory (the knowledge encoded within the model parameters) with non-parametric memory (external databases or documents) to improve the model’s performance and flexibility. This hybrid approach allows the model to dynamically retrieve relevant information during the inference process, enhancing its ability to generate accurate and contextually appropriate responses.
RAG models come in two primary configurations: RAG-Sequence and RAG-Token.
RAG-Sequence
In RAG-Sequence, the model retrieves relevant documents from an external knowledge base and then generates a response based on the sequence of these documents. This method involves the following steps:
- Document Retrieval: Using a retriever to fetch documents related to the input query.
- Sequence Generation: Using a generator to produce a sequence (i.e., an entire response) conditioned on the retrieved documents.
RAG-Token
RAG-Token operates at a finer granularity, generating responses token-by-token while conditioning on the retrieved documents. This token-level approach allows for more granular control over the response generation, potentially leading to more accurate and contextually appropriate outputs.
Components of RAG Models
RAG models are composed of two main components:
- Retriever (DPR): Dense Passage Retrieval (DPR) is used to fetch relevant documents from a large corpus. DPR leverages bi-encoders to embed queries and documents into a shared dense vector space, facilitating efficient retrieval.
- Generator (BART): Bidirectional and Auto-Regressive Transformers (BART) are used for generating responses. BART is a denoising autoencoder for sequence-to-sequence (seq2seq) models, which combines the strengths of bidirectional and autoregressive transformers.
Training and Decoding Methodologies
RAG models are trained using a combination of supervised and unsupervised techniques. During training:
- The retriever learns to fetch relevant documents by minimizing the distance between the query and relevant documents while maximizing the distance from irrelevant ones.
- The generator is fine-tuned on the retrieved documents to produce coherent and contextually appropriate responses.
Decoding in RAG models involves:
- Retrieving a set of candidate documents for a given query.
- Generating responses based on these documents, either sequentially (RAG-Sequence) or token-by-token (RAG-Token).
Retrieval-Augmented Generation (RAG) for Knowledge-Intensive NLP Tasks
Natural language processing (NLP) has undergone a revolution thanks to trained language models, which achieve cutting-edge results on various tasks. Even still, these models often fail in knowledge-intensive jobs requiring reasoning over explicit facts and textual material, despite their excellent skills.
Researchers have developed a novel strategy known as Retrieval-Augmented Generation (RAG) to get around this restriction. In this article, we will explore the limitations of pre-trained models and learn about the RAG model and its configuration, training, and decoding methodologies.