Dataset for Sentence Autocomplete Model

We will use a Wikipedia dataset that we can download from here. The one main problem with the Wikipedia dataset is that it has special characters, non-meaningful words, and unknown words we can not use directly in our model that is why before using the dataset for training our model we must have to clean it. Since our dataset is an ms-word document we will use the python-docx library for reading the dataset document. We can use the following command for installing the library.

!pip install python-docx

Python Code for Cleaning the Dataset

We will use the Python re-module for removing special characters and words between them. Also since Python is a case-sensitive language we will convert all the words to lower cases. As we are developing these models for only English-speaking audiences, we will remove non-English words.

Python3

import re 
import string 
import torch 
import pandas as pd 
from docx import Document 
  
# Read the DOCX file 
doc_path = "wikipedia.docx" 
doc = Document(doc_path) 
  
# Extract text from paragraphs 
text_data = [paragraph.text for paragraph in doc.paragraphs] 
  
# Convert text to lowercase 
text_data =  
  
# Remove special characters and words between them using regex 
text_data = [re.sub(r"\[.*?\]", "", text) for text in text_data] 
  
# Remove words not in the English alphabet 
english_alphabet = set(string.ascii_lowercase) 
text_data = [' '.join([word for word in text.split()\ 
                       if all(char in english_alphabet\ 
                              for char in word)]) for text in text_data] 
  
# Remove leading/trailing whitespaces 
text_data =  
  
# Remove empty sentences 
text_data =  
  
# Create a DataFrame with the cleaned text data 
df = pd.DataFrame({"Text": text_data}) 
  
# Save the cleaned text data to a CSV file 
output_path = "output.csv" 
# Set index=False to exclude the index column in the output 
df.to_csv(output_path, index=False)   
  
print("Text data cleaned and saved to:", output_path) 

Output:

Text data cleaned and saved to: /content/output.csv

This code will output a CSV file which will be a cleaned dataset document that we can use for training our model. We can download this cleaned CSV file from here.

Sentence Autocomplete Using Pytorch

Natural Language Processing(NLP) is one of the most flourishing parts of deep learning. Several applications of NLP are being used continuously in daily life. In this article, we are going to see how we can use NLP to autocomplete half-written sentences using deep learning methods. We will also see how we can generate clean data for training our NLP model. We will cover the following steps in this article

Cleaning the text data for training the NLP model
Loading the dataset using PyTorch
Creating the LSTM model
Training an NLP model
Making inferences from the trained model

We have seen applications like google keyboard where Google recommends what to type next based on the words which we have already written in the chatbox draft. However, to recommend the next term application like Google has been trained on billions of written sentences. In our model, we will use Wikipedia sentences that are freely available on the internet to download and that we can use for training our model.

Dataset for Sentence Autocomplete Model

Python Code for Cleaning the Dataset

Python3

Sentence Autocomplete Using Pytorch

Categories

Contact US

Dataset for Sentence Autocomplete Model

Python Code for Cleaning the Dataset

Python3

Sentence Autocomplete Using Pytorch

Similar Reads

Categories

Contact US