These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else



Download 186.06 Kb.
Page10/24
Date03.05.2017
Size186.06 Kb.
1   ...   6   7   8   9   10   11   12   13   ...   24

Huffman Compression

    • See example on board
  • Huffman Issues:
    • Is the code correct?
      • Does it satisfy the prefix property?
    • Does it give good compression?
    • How to decode?
    • How to encode?
    • How to determine weights/frequencies?

Huffman Compression

  • Is the code correct?
    • Based on the way the tree is formed, it is clear that the codewords are valid
    • Prefix Property is assured, since each codeword ends at a leaf
      • all original nodes corresponding to the characters end up as leaves
  • Does it give good compression?
    • For a block code of N different characters, log2N bits are needed per character

Huffman Compression

    • Given Huffman codes {C0,C1,…CN-1} for the N characters in the alphabet, each of length |Ci|
    • Given frequencies {F0,F1,…FN-1} in the file
      • Where sum of all frequencies = M
    • The total bits required for the file is:
      • Sum from 0 to N-1 of (|Ci| * Fi)
      • Overall total bits depends on differences in frequencies
        • The more extreme the differences, the better the compression
        • If frequencies are all the same, no compression
    • See example from board

Huffman Compression

  • How to decode?
    • This is fairly straightforward, given that we have the Huffman tree available
        • start at root of tree and first bit of file
        • while not at end of file
        • if current bit is a 0, go left in tree
        • else go right in tree // bit is a 1
        • if we are at a leaf
        • output character
        • go to root
        • read next bit of file
        • Each character is a path from the root to a leaf
        • If we are not at the root when end of file is reached, there was an error in the file

Huffman Compression

  • How to encode?
    • This is trickier, since we are starting with characters and outputing codewords
      • Using the tree we would have to start at a leaf (first finding the correct leaf) then move up to the root, finally reversing the resulting bit pattern
      • Instead, let's process the tree once (using a traversal) to build an encoding TABLE.
      • Demonstrate inorder traversal on board

Huffman Compression

  • How to determine weights/frequencies?
    • 2-pass algorithm
      • Process the original file once to count the frequencies, then build the tree/code and process the file again, this time compressing
      • Ensures that each Huffman tree will be optimal for each file
      • However, to decode, the tree/freq information must be stored in the file
        • Likely in the front of the file, so decompress first reads tree info, then uses that to decompress the rest of the file
        • Adds extra space to file, reducing overall compression quality

Huffman Compression

        • Overhead especially reduces quality for smaller files, since the tree/freq info may add a significant percentage to the file size
        • Thus larger files have a higher potential for compression with Huffman than do smaller ones
          • However, just because a file is large does NOT mean it will compress well
          • The most important factor in the compression remains the relative frequencies of the characters
    • Using a static Huffman tree
      • Process a lot of "sample" files, and build a single tree that will be used for all files
      • Saves overhead of tree information, but generally is NOT a very good approach

Huffman Compression

        • There are many different file types that have very different frequency characteristics
          • Ex: .cpp file vs. .txt containing an English essay
          • .cpp file will have many ;, {, }, (, )
          • .txt file will have many a,e,i,o,u,., etc.
          • A tree that works well for one file may work poorly for another (perhaps even expanding it)
    • Adaptive single-pass algorithm
      • Builds tree as it is encoding the file, thereby not requiring tree information to be separately stored
      • Processes file only one time
      • We will not look at the details of this algorithm, but the LZW algorithm and the self-organizing search algorithm we will discuss next are also adaptive

Huffman Shortcomings

  • What is Huffman missing?
    • Although OPTIMAL for single character (word) compression, Huffman does not take into ac- count patterns / repeated sequences in a file
    • Ex: A file with 1000 As followed by 1000 Bs, etc. for every ASCII character will not compress AT ALL with Huffman
      • Yet it seems like this file should be compressable
      • We can use run-length encoding in this case (see text)
        • However run-length encoding is very specific and not generally effective for most files (since they do not typically have long runs of each character)

Compression through self-organizing search

  • Consider two searching heuristics:
    • Move-to-Front (MTF)
      • Search a list for the desired key
      • When found, remove the key from its current position and place it in the front of the list
        • If the list is an array, shift other keys down
    • Transpose
      • Search a list for the desired key
      • When found, swap positions with the key before it in the list
        • In other words, "transpose" it with the key that was before it

Compression through self-organizing search

    • Idea for both heuristics is that frequently accessed keys will end up near the front of the list
    • These improve search times for popular items
  • How can these be used in compression?
    • Consider a list of uncompressed codewords
      • For example, single bytes 0-255
    • In original form, each requires 8 bits to represent (as we previously discussed)
    • Like Huffman, we'd like to represent more frequently used characters with fewer bits, and less frequently used characters with more bits

Download 186.06 Kb.

Share with your friends:
1   ...   6   7   8   9   10   11   12   13   ...   24




The database is protected by copyright ©sckool.org 2020
send message

    Main page