Text Cleaning
Sometimes our acquired data is not very clean. it may contain HTML tags, spelling mistakes, or special characters. So, let’s see some techniques to clean our text data.
- Unicode Normalization: if text data may contain symbols, emojis, graphic characters, or special characters. Either we can remove these characters or we can convert this to machine-readable text.
Python3
# Unicode Nomalization text = "w3wiki ????" print (text.encode( 'utf-8' )) text1 = 'गीक्स फॉर गीक्स ????' print (text1.encode( 'utf-8' )) |
Output :
b'w3wiki \xf0\x9f\x98\x80' b'\xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 \xe0\xa4\xab\xe0\xa5\x89\xe0\xa4\xb0 \xe0\xa4\x97\xe0\xa5\x80\xe0\xa4\x95\xe0\xa5\x8d\xe0\xa4\xb8 ????'
- Regex or Regular Expression: Regular Expression is the tool that is used for searching the string of specific patterns. Suppose our data contain phone number, email-Id, and URL. we can find such text using the regular expression. After that either we can keep or remove such text patterns as per requirements.
- Spelling corrections: When our data is extracted from social media. Spelling mistakes are very common in that case. To overcome this problem we can create a corpus or dictionary of the most common mistype words and replace these common mistakes with the correct word.
Python3
import re text = """<gfg> #GFG Geeks Learning together url <https://www.w3wiki.org/>, email <acs@sdf.dv> """ def clean_text(text): # remove HTML TAG html = re. compile ( '[<,#*?>]' ) text = html.sub(r'',text) # Remove urls: url = re. compile ( 'https?://\S+|www\.S+' ) text = url.sub(r'',text) # Remove email id: email = re. compile ( '[A-Za-z0-2]+@[\w]+.[\w]+' ) text = email.sub(r'',text) return text print (clean_text(text)) |
Output:
gfg GFG Geeks Learning together url email
Natural Language Processing (NLP) Pipeline
Natural Language Processing is referred to as NLP. It is a subset of artificial intelligence that enables machines to comprehend and analyze human languages. Text or audio can be used to represent human languages.
The natural language processing (NLP) pipeline refers to the sequence of processes involved in analyzing and understanding human language. The following is a typical NLP pipeline:
The basic processes for all the above tasks are the same. Here we have discussed some of the most common approaches which are used during the processing of text data.