- Huffman Issues:
- Is the code correct?
- Does it satisfy the prefix property?
- Does it give good compression?
- How to decode?
- How to encode?
- How to determine weights/frequencies?
Huffman Compression - Is the code correct?
- Based on the way the tree is formed, it is clear that the codewords are valid
- Prefix Property is assured, since each codeword ends at a leaf
- all original nodes corresponding to the characters end up as leaves
- Does it give good compression?
- For a block code of N different characters, log2N bits are needed per character
Huffman Compression - Given Huffman codes {C0,C1,…CN-1} for the N characters in the alphabet, each of length |Ci|
- Given frequencies {F0,F1,…FN-1} in the file
- Where sum of all frequencies = M
- The total bits required for the file is:
- Sum from 0 to N-1 of (|Ci| * Fi)
- Overall total bits depends on differences in frequencies
- The more extreme the differences, the better the compression
- If frequencies are all the same, no compression
- See example from board
Huffman Compression - How to decode?
- This is fairly straightforward, given that we have the Huffman tree available
- start at root of tree and first bit of file
- while not at end of file
- if current bit is a 0, go left in tree
- else go right in tree // bit is a 1
- if we are at a leaf
- output character
- go to root
- read next bit of file
- Each character is a path from the root to a leaf
- If we are not at the root when end of file is reached, there was an error in the file
Huffman Compression - How to encode?
- This is trickier, since we are starting with characters and outputing codewords
- Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern
- Instead, let's process the tree once (using a traversal) to build an encoding TABLE.
- Demonstrate inorder traversal on board
Huffman Compression - How to determine weights/frequencies?
- 2-pass algorithm
- Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing
- Ensures that each Huffman tree will be optimal for each file
- However, to decode, the tree/freq information must be stored in the file
- Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file
- Adds extra space to file, reducing overall compression quality
Huffman Compression - Overhead especially reduces quality for smaller files, since the tree/freq info may add a significant percentage to the file size
- Thus larger files have a higher potential for compression with Huffman than do smaller ones
- However, just because a file is large does NOT mean it will compress well
- The most important factor in the compression remains the relative frequencies of the characters
- Using a static Huffman tree
- Process a lot of "sample" files, and build a single tree that will be used for all files
- Saves overhead of tree information, but generally is NOT a very good approach
Huffman Compression - There are many different file types that have very different frequency characteristics
- Ex: .cpp file vs. .txt containing an English essay
- .cpp file will have many ;, {, }, (, )
- .txt file will have many a,e,i,o,u,., etc.
- A tree that works well for one file may work poorly for another (perhaps even expanding it)
- Adaptive single-pass algorithm
- Builds tree as it is encoding the file, thereby not requiring tree information to be separately stored
- Processes file only one time
- We will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm we will discuss next are also adaptive
Huffman Shortcomings - What is Huffman missing?
- Although OPTIMAL for single character (word) compression, Huffman does not take into ac- count patterns / repeated sequences in a file
- Ex: A file with 1000 As followed by 1000 Bs, etc. for every ASCII character will not compress AT ALL with Huffman
- Yet it seems like this file should be compressable
- We can use run-length encoding in this case (see text)
- However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)
Compression through self-organizing search - Consider two searching heuristics:
- Move-to-Front (MTF)
- Search a list for the desired key
- When found, remove the key from its current position and place it in the front of the list
- If the list is an array, shift other keys down
- Transpose
- Search a list for the desired key
- When found, swap positions with the key before it in the list
- In other words, "transpose" it with the key that was before it
Compression through self-organizing search - Idea for both heuristics is that frequently accessed keys will end up near the front of the list
- These improve search times for popular items
- How can these be used in compression?
- Consider a list of uncompressed codewords
- For example, single bytes 0-255
- In original form, each requires 8 bits to represent (as we previously discussed)
- Like Huffman, we'd like to represent more frequently used characters with fewer bits, and less frequently used characters with more bits
Share with your friends: |