XML encoding methods
The XML documents can be encoded in one of the formats listed below.
- UTF-8
- UTF-16
- Latin1
- US-ASCII
- ISO-8859-1 to ISO-8859-10
Amongst these methods, UTF-8 is commonly found. UTF-16 allows 2 bytes for each character and the documents with ‘0xx’ are encoded by this method. Latin1 covers Western European characters.
Encoding in BeautifulSoup
The character encoding plays a major role in the interpretation of the content of an HTML and XML document. A document does not only contain English characters but also non-English characters like Hebrew, Latin, Greek and much more. To let the parser know, which encoding method should be used, the documents will contain a dedicated tag and attribute to specify this. For example:
In HTML documents
<meta charset=”–encoding method name–” content=”text/html”>
In XML documents
<?xml version=”1.0″ encoding=”–encoding method name–“?>
These tags convey the browser which encoding method can be used for parsing. If the proper encoding method is not specified, either the content is rendered incorrectly or sometimes with the replacement character ‘�’.