These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else


Compression through self-organizing search



Download 186.06 Kb.
Page12/24
Date03.05.2017
Size186.06 Kb.
1   ...   8   9   10   11   12   13   14   15   ...   24

Compression through self-organizing search

    • Note that for each number of bits from 2 on, ½ of the possible codewords are not used
      • Ex: 00110, 00111 are used but 00100 and 00101 are not
      • With slightly more complicated preprocessing, we can use all of these codewords
        • But we must be careful so that all possible positions can be represented and unambiguously decoded
      • This will improve our overall compression quality since it will pack more possible values into shorter codewords
    • Compression quality will also improve if we consider more initial possibilities
      • Ex: Instead of 8 bits use 16 bits for uncompressed data
        • Now we have 216 possible position values

Compression through self-organizing search

  • Decompression
    • We know in advance the uncompressed key size, k (ex: 8 bits or 16 bits)
    • Array is initialized just as in compression phase
      • Since we are only looking up positions, we only need one array (Array2), indexed on position
    • while not eof
      • Read in lg2k bits as an integer, N
      • Read in N+1 bits as an integer, pos (the position value)
      • Using Array2, find the key at location pos and output it
      • Update Array2 using MTF or Transpose
    • See example on board

Comparison to other compression algorithms

    • It is natural to compare the self-organizing search compression algorithm to Huffman
      • Both compress fixed size original codewords into variable length codewords
        • For a file with extreme frequency characteristics, Huffman may do better
        • However, self-organizing search handles changes in frequencies better
          • Recall example: 1000As1000Bs1000Cs…
          • To Huffman, all frequencies will be the same
          • Using MTF after the first occurrence of each letter, the others will have excellent compression
    • But compression is still 1-1 original codeword to compressed codeword

Comparison to other compression algorithms

    • Clearly, the size of the original codewords considered is important
      • With 8 bits at a time, the best compression we can do is about ½ (4/8), since we need 3 bits then at least 1 bit
      • With 16 bit original codewords, now we can achieve a "best ratio" of 5/16 since we need 4 bits then at least 1 bit
      • With 32 bit original codewords, we can do even better – 6/32
        • However, note now that we will need 232 positions in our arrays – this is likely too large for us to use effectively (even if we had the memory, the run-times would be prohibitive)

LZW Compression

  • Idea of LZW:
    • Instead of using variable length codewords to encode single characters, use block codewords to to encode groups of characters
      • The more characters that can be represented by a single codeword, the better the compression
        • Ex: Consider the word "the" – 24 bits using ASCII
          • If we can encode the entire word in one 12 bit codeword, we have cut its size in half
      • Groups are built up gradually as patterns within the file are repeated

LZW Compression

  • LZW Compression Algorithm:
    • See handout
    • See lzw.txt and notes from board
  • LZW Decompression Algorithm
    • See handout
    • See lzw2.txt and notes from board

LZW Compression

  • Why / how does it work?
    • The compression and decompression algorithms are both building the EXACT SAME (codeword, string) dictionary as they proceed
      • Compression stores them as (string, codeword)
        • During compression, strings are looked up and codewords are returned
      • Decompression stores them as (codeword, string)
        • During decompression, codewords are looked up and strings are returned
      • As long as both follow the same steps, the compressed file does not need to store any extra information about the code

LZW Compression

    • This is an adaptive algorithm
      • The compression adapts to the patterns as they are seen, and the decompression does the same
    • However, as we discussed, the decompression algorithm is one step "behind" the compression algorithm in building the dictionary
      • In most situations this is not a problem
      • However, if, during compression, the (pattern, codeword) that was just added to the dictionary is immediately used in the next step, the decompression algorithm will not yet know the codeword
      • Luckily this special case can be recognized and handled relatively easily
        • See lzw3.txt

LZW Implementation Issues

  • How to represent / manage the dictionary
  • How many bits to use for the codewords
  • What to do when / if we run out of codewords
  • How to do I/O with fractions of bytes
    • This issue applies to Huffman as well, so discussion here will be applicable there

LZW Implementation Issues

  • How to represent / manage the dictionary
    • What operations do we need?
      • Insert and Lookup
    • For a file with M characters, we will need to do M Lookups
      • Number of Inserts depends on how long our patterns get
      • Thus we want these to be VERY FAST
        • Sorted Array takes much too long for Inserts
        • BST would be ok, but even lgM per Lookup is probably more time than we want to spend
          • Would yield total of Theta(MlgM) total time
        • Do we have any better options?

LZW Implementation Issues

      • Two most likely candidates for ENCODING dictionary:
        • Trie or Hash Table
        • Both allow lookups in time proportional to string length, independent of the number of strings
        • Trie insert is a bit tricky, but if we use a DLB, it is not too bad
      • Consider implementation done in lzw.c
        • Hash table used, but in a clever way
        • Instead of storing the entire string each time (wasting a lot of space), they store the codeword for the "prefix" and the last character
          • This works because strings are built one character at a time, always adding on the strings already in the dictionary
        • See code and board

Download 186.06 Kb.

Share with your friends:
1   ...   8   9   10   11   12   13   14   15   ...   24




The database is protected by copyright ©sckool.org 2020
send message

    Main page