Friday, May 16, 2014

Huffman's encoding

Huffman's encoding is an algorithm that, given a file to encode that consists entirely of all of the letters present in an alphabet alpha, and the frequencies of each letter appears in the file for all of the letters in alpha, returns a binary search tree that gives a prefix-free encoding that is guaranteed to use minimal space.

What does "prefix-free" mean? Let's try to encode two letters, "a" and "b" as an example. Say that "a" has an encoding of "00", and "b" has an encoding of "0". Then if I have a string I want to decode that is "00000", how do I know which parts of this string are "a"s, and which parts of this string are "b"s? For this example encoding, one would not be able to tell where one letter ends and another begins without a prefix. Note that we wouldn't have this encoding problem if every single letter in the alphabet had the same length encoding, but that we wouldn't be able to get a minimal encoding taking advantage of the alphabet frequencies we know, if everything was of the same length.

Huffman's encoding first puts all of the letters into a priority queue, prioritizing by the frequencies in an ascending manner, with the two letters with the two smallest frequencies at the head of the queue. Then these first two minimum frequency letters are dequeued and a new parent "node" is created from those two letters, with the two letters as children, and with the node's frequency equal to the sum of the two letters' frequencies. This new parent node is then inserted back into the priority queue. Note that this reinsertion is done on a priority queue, and not just a queue, so the ascending order still holds true. Repeat this tree generation until only the root node is left in the priority queue. This results in the root node consisting of a traversable binary search tree that can be used for decoding.



Dasgupta, S., & Papadimitriou, C. H. (2008). Algorithms. Boston: McGraw-Hill Higher Education.

No comments:

Post a Comment