Text Cleaning

Sometimes our acquired data is not very clean. it may contain HTML tags, spelling mistakes, or special characters. So, let’s see some techniques to clean our text data.

Unicode Normalization: if text data may contain symbols, emojis, graphic characters, or special characters. Either we can remove these characters or we can convert this to machine-readable text.

Python3

# Unicode Nomalization 
text = "w3wiki ????"
print(text.encode('utf-8')) 
  
text1 = 'गीक्स फॉर गीक्स ????'
print(text1.encode('utf-8'))

Output :

b'w3wiki \xf0\x9f\x98\x80'
b'\xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 \xe0\xa4\xab\xe0\xa5\x89\xe0\xa4\xb0 
\xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 ????'

Regex or Regular Expression: Regular Expression is the tool that is used for searching the string of specific patterns. Suppose our data contain phone number, email-Id, and URL. we can find such text using the regular expression. After that either we can keep or remove such text patterns as per requirements.
Spelling corrections: When our data is extracted from social media. Spelling mistakes are very common in that case. To overcome this problem we can create a corpus or dictionary of the most common mistype words and replace these common mistakes with the correct word.

Python3

import re 
text = """<gfg>  
#GFG Geeks Learning together  
url <https://www.w3wiki.org/>,  
email <acs@sdf.dv> 
"""
def clean_text(text): 
    # remove HTML TAG 
    html = re.compile('[<,#*?>]') 
    text = html.sub(r'',text) 
    # Remove urls: 
    url = re.compile('https?://\S+|www\.S+') 
    text = url.sub(r'',text) 
    # Remove email id: 
    email = re.compile('[A-Za-z0-2]+@[\w]+.[\w]+') 
    text = email.sub(r'',text) 
    return text 
print(clean_text(text))

Output:

gfg 
GFG Geeks Learning together 
url  
email

Natural Language Processing (NLP) Pipeline

Natural Language Processing is referred to as NLP. It is a subset of artificial intelligence that enables machines to comprehend and analyze human languages. Text or audio can be used to represent human languages.

The natural language processing (NLP) pipeline refers to the sequence of processes involved in analyzing and understanding human language. The following is a typical NLP pipeline:

The basic processes for all the above tasks are the same. Here we have discussed some of the most common approaches which are used during the processing of text data.

Text Cleaning

Python3

Python3

Natural Language Processing (NLP) Pipeline

Categories

Contact US

Text Cleaning

Python3

Python3

Natural Language Processing (NLP) Pipeline

Similar Reads

Categories

Contact US