How to retain uniqueness of compressed text?

During the encoding process in compression, every character can be assigned and represented by a variable-length binary code. But, the problem with this approach is its decoding. At some point during the decoding process, two or more characters may have the same prefix of code, causing the algorithm to become confused. Hence, the “prefix rule” is used which makes sure that the algorithm only generates uniquely decodable codes. In this way, none of the codes are prefixed to the other and hence the uncertainty can be resolved. 

Hence, for text file compression in this article, we decide to leverage an algorithm that gives lossless compression and uses variable-length encoding with prefix rule. The article also focuses on regenerating the original file using the decoding process.

Text File Compression And Decompression Using Huffman Coding

Text files can be compressed to make them smaller and faster to send, and unzipping files on devices has a low overhead. The process of encoding involves changing the representation of a file so that the (binary) compressed output takes less space to store and takes less time to transmit while retaining the ability to reconstruct the original file exactly from its compressed representation. Text files can be of various file types, such as HTML, JavaScript, CSS, .txt, and so on. Text compression is required because uncompressed data can take up a lot of space, which is inconvenient for device storage and file sharing.

Similar Reads

How Does The Process Of Compression Work?

The size of the text file can be reduced by compressing it, which converts the text to a smaller format that takes up less space. It typically works by locating similar strings/characters within a text file and replacing them with a temporary binary representation to reduce the overall file size. There are two types of file compression,...

How to retain uniqueness of compressed text?

During the encoding process in compression, every character can be assigned and represented by a variable-length binary code. But, the problem with this approach is its decoding. At some point during the decoding process, two or more characters may have the same prefix of code, causing the algorithm to become confused. Hence, the “prefix rule” is used which makes sure that the algorithm only generates uniquely decodable codes. In this way, none of the codes are prefixed to the other and hence the uncertainty can be resolved....

Compressing a Text File:

We use the Huffman Coding algorithm for this purpose which is a greedy algorithm that assigns variable length binary codes for each input character in the text file. The length of the binary code depends on the frequency of the character in the file. The algorithm suggests creating a binary tree where all the unique characters of a file are stored in the tree’s leaf nodes....

Compressed File Structure:

We’ve talked about variable length input code generation and replacing it with the file’s original characters so far. However, this only serves to compress the file. The more difficult task is to decompress the file by decoding the binary codes to their original value....

Decompressing the Compressed File:

The compressed file is opened, and the number of unique characters and the total number of characters in the file are retrieved.The characters and their binary codes are then read from the file. We can recreate the Huffman tree using this.For each binary code: A left edge is created for 0, and a right edge is created for 1. Finally, a leaf node is formed and the character is stored within it. This is repeated for all characters and binary codes. The Huffman tree is thus recreated in this manner.The remaining file is now read bit by bit, and the corresponding 0/1 bit in the tree is traversed. The corresponding character is written into the decompressed file as soon as a leaf node is encountered in the tree.Step 4 is repeated until the compressed file has been read completely....