XML encoding methods

The XML documents can be encoded in one of the formats listed below. 

  • UTF-8 
  • UTF-16
  • Latin1
  • US-ASCII
  • ISO-8859-1 to ISO-8859-10

Amongst these methods, UTF-8 is commonly found. UTF-16 allows 2 bytes for each character and the documents with ‘0xx’ are encoded by this method. Latin1 covers Western European characters.

Encoding in BeautifulSoup

The character encoding plays a major role in the interpretation of the content of an HTML and XML document. A document does not only contain English characters but also non-English characters like Hebrew, Latin, Greek and much more. To let the parser know, which encoding method should be used, the documents will contain a dedicated tag and attribute to specify this. For example:

In HTML documents

<meta charset=”–encoding method name–” content=”text/html”>

In XML documents

<?xml version=”1.0″  encoding=”–encoding method name–“?>

These tags convey the browser which encoding method can be used for parsing. If the proper encoding method is not specified, either the content is rendered incorrectly or sometimes with the replacement character ‘�’. 

Similar Reads

XML encoding methods

The XML documents can be encoded in one of the formats listed below....

HTML encoding methods

The HTML and HTML5 documents can be encoded by any one of the methods below....

BeautifulSoup and encoding

The BeautifulSoup module, popularly imported as bs4, is a boon that makes HTML/XML parsing a cake-walk. It has a rich number of methods among which one helps to select contents by their tag name or by the attribute present in the tag, one helps to extract the content based on the hierarchy, printing content with indentation required for HTML, and so on. The bs4 module auto-detects the encoding method used in the documents and converts it to a suitable format efficiently. The returned BeautifulSoup object will have various attributes which give more information. However, sometimes it incorrectly predicts the encoding method. Thus, if the encoding method is known by the user, it is good to pass it as an argument. This article provides the various ways in which the encoding methods can be specified in the bs4 module....