Generative Models

Data-to-text generation

It is a natural language generation technique to create artificial texts based on unstructured data like data tables, JSONs, knowledge graphs, etc, or structured databases like SQL, PostgreSQL, etc.

1. Soft Data-to-text generation

This technique uses a soft computing approach(like BERT, roBERTa, BART) to create sentences from unstructured data. This technique requires supervised training to map each unique type of unstructured data to create semantically meaningful sentences from the data. 

Complex and insightful sentences can be generated using this technique when a rich language model is used. These techniques are implemented in large language models to create conversational AIs similar to ChatGPT.

original text: Who is the president of the United States?
hard generated text: Joe Biden is currently the serving president of the United States of America.

2 Hard Data-to-text generation 

This technique makes use of custom algorithms to create meaningful strings by preprocessing the unstructured/structured data. This type of text generation relies on understanding the existing dataset completely, making manual inferences, and producing its corresponding text. The complexity of the generated text relies on the developer’s insight into the data. 

This technique is usually used as verbose to make it user-understandable in data warehouses, diary entries dataset, etc where the end-user has direct read access to it.

original text: 20/03/2023 - 04/04/2023
hard generated text: From 20th of march to the 4th of April.

text-to-text generation

1. Text summarization

This text generation technique relies on generating summarised sentences from documents/articles, where we reduce the larger sentences into smaller sentences while keeping the semantic meaning of the original text. There are two approaches to this:

  1. Extractive method: This method uses the frequency-based approach to keep only those sentences containing the most-frequent topic words.
  2. Abstractive method: This is a more advanced, and powerful method that makes use of language models, Bi-directional transformers(BERT), and GPT.
original text: This is a geekforgeeks example. It is part of a larger example and we can probably shorten it to create a new sentence.
summarised text: We can shorten this example to create a new sentence.

2. Paraphrasing

Paraphrasing in itself is an NLP subtask that is specifically used to generate semantically coherent sentences using altering words. Paraphrasing uses rule-based approaches like synonym replacement of POS-tagged words of the same tag as well as uses the machine-learning-based approach to change the entire sentence altogether without changing its meaning.

original text: This is a geekforgeeks example.
paraphrased text: This is an example from geekforgeeks.

TextAttack library and Data Augmentation

TextAttack is a Python framework created especially for use in data augmentation, adversarial training, and adversarial attacks in the Natural Language Processing (NLP) domain. Only the field of text data augmentation will be covered in this essay.

The textattack operates under the textattack framework.The six different methods of the Augmenter class are specifically designed for NLP data augmentation.The six types of augmenters are:

  • WordNetAugmenter
  • EmbeddingAugmenter
  • CharSwapAugmenter
  • CheckListAugmenter
  • EasyDataAugmenter
  • CLAREAugumenter

For installation of Textattack

!pip install textattack

WordNetAugmenter

The WordNetAugmenter in TextAttack utilizes WordNet, a lexical database of the English language, for word substitution-based data augmentation. It replaces words in a sentence with their synonyms or hypernyms, enhancing text diversity.

Python3




from textattack.augmentation import WordNetAugmenter
 
augmenter = WordNetAugmenter()
 
# Example usage:
sentence = "The quick brown fox jumps over the lazy dog."
augmented_sentence = augmenter.augment(sentence)
 
print(f"Original Sentence: {sentence}")
print(f"Augmented Sentence: {augmented_sentence}")


Output:

Original Sentence: The quick brown fox jumps over the lazy dog.
Augmented Sentence: ['The quick brown fox jumps over the lazy click.']

Embedding Augmenter

The EmbeddingAugmenter in TextAttack is designed to perform data augmentation by substituting words in a sentence with their word embeddings. This augments the text while preserving semantic meaning.

Python3




from textattack.augmentation import EmbeddingAugmenter
 
# Initialize the EmbeddingAugmenter
embed_aug = EmbeddingAugmenter()
 
# Example usage:
original_text = "TextAttack is a powerful library for NLP."
augmented_text = embed_aug.augment(original_text)
 
print(f"Original Text: {original_text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['TextAttack is a emphatic library for NLP.']

CharSwapAugmenter

The CharSwapAugmenter in TextAttack is an augmentation technique that randomly swaps adjacent characters in a word to introduce small, character-level perturbations. This can help improve the robustness of natural language processing models by simulating variations in the input text at the character level.

Python3




from textattack.augmentation import CharSwapAugmenter
 
# Initialize the CharSwapAugmenter
char_swap_augmenter = CharSwapAugmenter()
 
# Example usage:
original_text = "TextAttack is a powerful library for NLP."
augmented_text = char_swap_augmenter.augment(original_text)
 
print(f"Original Text: {original_text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['TextAttack is a powerqul library for NLP.']

Checklist Augmenter

The ChecklistAugmenter in TextAttack is an augmentation technique that uses pre-defined transformations to generate perturbed versions of the input text. It leverages the CheckList library’s transformations for various NLP tasks. The CheckList library provides a collection of linguistic transformations that can be applied to text data to test and improve model robustness.

Python3




from textattack.augmentation import CheckListAugmenter
 
# Sample text
text = "TextAttack is a powerful library for NLP."
 
# Initialize the CheckListAugmenter
checklist_augmenter = CheckListAugmenter()
 
# Apply CheckList transformations
augmented_text = checklist_augmenter.augment(text)
 
# Print the results
print(f"Original Text: {text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['TextAttack is a powerful library for NLP.']

EasyDataAugmenter

EDA enhances text by incorporating a mix of word insertions, substitutions, and deletions.

Python3




# Assuming EasyDataAugmenter exists in the textattack library (this is a hypothetical example)
from textattack.augmentation import EasyDataAugmenter
 
# Example text
text = "TextAttack is a powerful library for NLP."
 
# Initialize the EasyDataAugmenter
eda_augmenter = EasyDataAugmenter()
 
# Apply EasyDataAugmenter for text augmentation
augmented_text = eda_augmenter.augment(text)
 
# Print the results
print(f"Original Text: {text}")
print(f"Augmented Text: {augmented_text}")


Output:

Original Text: TextAttack is a powerful library for NLP.
Augmented Text: ['NLP is a powerful library for TextAttack.',
'TextAttack is a potent library for NLP.',
'TextAttack is a powerful for NLP.',
'ampere TextAttack is a powerful library for NLP.']

CLAREAugmenter

It enhances text through the utilization of a pre-trained masked language model, involving operations such as replacement, insertion, and merging.

Back Translation

Back translation technique is used when the required data is present in a different language. The documents present in the source language is translated to the target language using Machine Translation models. A major drawback of this method is that meaning of language-specific words is sometimes lost during translation. Therefore, the corpus to be translated should contain simple words that can be easily and accurately translated. 

original text: This is a geek for geeks example.
translated text: यह गीक्स उदाहरण के लिए एक गीक है।

Back Transliteration 

Back transliteration is a technique used to generate sentences/phrases that sound phonetically similar to the source language. This is useful for generating training data for classification tasks that contain localized, or bi-lingual phrases where the target language is a low-resource language i.e. has fewer data sources.

original text: This is a geek for geeks example.
transliterated text: थिस इस अ गैक फ़ोर गैक्स ए‍अम्प्ले

Advantages of Data Augmentation

A number of benefits are provided by data augmentation in natural language processing (NLP), which enhances model resilience and performance:

  • Increased Data Diversity: Data augmentation introduces variations in the input data by creating diverse instances of the original data. This helps expose the model to a broader range of linguistic patterns and variations.
  • Improved Generalization: Augmented data aids in enhancing the generalization ability of NLP models. By presenting the model with a more extensive and varied dataset during training, it learns to handle a wider array of scenarios, leading to better performance on unseen data.
  • Addressing Data Scarcity: In many NLP tasks, obtaining a large labeled dataset can be challenging. Data augmentation mitigates the issue of data scarcity by artificially expanding the dataset, allowing models to be trained on a more substantial amount of data.

Disadvantages of Data Augmentation

Natural language processing (NLP) data augmentation has a number of benefits, but there are also some drawbacks and things to keep in mind:

  • Risk of Introducing Unintended Biases: Data augmentation methods may inadvertently introduce biases into the augmented data, potentially leading to biased model predictions. Careful consideration is needed to ensure that augmented samples do not reinforce existing biases or introduce new ones.
  • Potential Overfitting to Augmented Patterns: If not carefully controlled, models might overfit to the specific patterns introduced by data augmentation, rather than learning more generalizable features. This can occur if augmentation is applied excessively or without proper validation.
  • Increased Computational Complexity: Augmenting data increases the computational requirements during training, as the model needs to process a larger amount of augmented data. This can lead to longer training times and increased resource consumption.

Text augmentation techniques in NLP

Text augmentation is an important aspect of NLP to generate an artificial corpus. This helps in improving the NLP-based models to generalize better over a lot of different sub-tasks like intent classification, machine translation, chatbot training, image summarization, etc.

Text augmentation is used when:

  • There is an absence of sufficient variation in the text corpus.
  • There is a high data imbalance during intent classification tasks.
  • The overall quantity of data is insufficient for data-hungry machine-learning models.  

Table of Content

  • Data Augmentation in NLP
  • Easy data Augmentation
  • Generative Models

Similar Reads

Data Augmentation in NLP

Data Augmentation (DA) is a technique employed to artificially expand training datasets by creating various versions of existing data without the need for additional data collection. Its primary goal is to enhance classification task performance by altering the data while preserving class categories....

Easy data Augmentation

Synonym Replacement...

Generative Models

Data-to-text generation...

Frequently Asked Questions

...