- Note that for each number of bits from 2 on, ½ of the possible codewords are not used
- Ex: 00110, 00111 are used but 00100 and 00101 are not
- With slightly more complicated preprocessing, we can use all of these codewords
- But we must be careful so that all possible positions can be represented and unambiguously decoded
- This will improve our overall compression quality since it will pack more possible values into shorter codewords
- Compression quality will also improve if we consider more initial possibilities
- Ex: Instead of 8 bits use 16 bits for uncompressed data
- Now we have 216 possible position values
## Compression through self-organizing search - Decompression
- We know in advance the uncompressed key size, k (ex: 8 bits or 16 bits)
- Array is initialized just as in compression phase
- Since we are only looking up positions, we only need one array (Array2), indexed on position
- while not eof
- Read in lg2k bits as an integer, N
- Read in N+1 bits as an integer, pos (the position value)
- Using Array2, find the key at location pos and output it
- Update Array2 using MTF or Transpose
- See example on board
## Comparison to other compression algorithms - It is natural to compare the self-organizing search compression algorithm to Huffman
- Both compress fixed size original codewords into variable length codewords
- For a file with extreme frequency characteristics, Huffman may do better
- However, self-organizing search handles changes in frequencies better
- Recall example: 1000As1000Bs1000Cs…
- To Huffman, all frequencies will be the same
- Using MTF after the first occurrence of each letter, the others will have excellent compression
- But compression is still 1-1 original codeword to compressed codeword
## Comparison to other compression algorithms - Clearly, the size of the original codewords considered is important
- With 8 bits at a time, the best compression we can do is about ½ (4/8), since we need 3 bits then at least 1 bit
- With 16 bit original codewords, now we can achieve a "best ratio" of 5/16 since we need 4 bits then at least 1 bit
- With 32 bit original codewords, we can do even better – 6/32
- However, note now that we will need 232 positions in our arrays – this is likely too large for us to use effectively (even if we had the memory, the run-times would be prohibitive)
## LZW Compression - Idea of LZW:
- Instead of using variable length codewords to encode single characters, use block codewords to to encode groups of characters
- The more characters that can be represented by a single codeword, the better the compression
- Ex: Consider the word "the" – 24 bits using ASCII
- If we can encode the entire word in one 12 bit codeword, we have cut its size in half
- Groups are built up gradually as patterns within the file are repeated
## LZW Compression - LZW Compression Algorithm:
- See handout
- See lzw.txt and notes from board
- LZW Decompression Algorithm
- See handout
- See lzw2.txt and notes from board
## LZW Compression - Why / how does it work?
- The compression and decompression algorithms are both building the EXACT SAME (codeword, string) dictionary as they proceed
- Compression stores them as (string, codeword)
- During compression, strings are looked up and codewords are returned
- Decompression stores them as (codeword, string)
- During decompression, codewords are looked up and strings are returned
- As long as both follow the same steps, the compressed file does not need to store any extra information about the code
## LZW Compression - This is an adaptive algorithm
- The compression adapts to the patterns as they are seen, and the decompression does the same
- However, as we discussed, the decompression algorithm is one step "behind" the compression algorithm in building the dictionary
- In most situations this is not a problem
- However, if, during compression, the (pattern, codeword) that was just added to the dictionary is immediately used in the next step, the decompression algorithm will not yet know the codeword
- Luckily this special case can be recognized and handled relatively easily
## LZW Implementation Issues - How to represent / manage the dictionary
- How many bits to use for the codewords
- What to do when / if we run out of codewords
- How to do I/O with fractions of bytes
- This issue applies to Huffman as well, so discussion here will be applicable there
## LZW Implementation Issues - How to represent / manage the dictionary
- What operations do we need?
- For a file with M characters, we will need to do M Lookups
- Number of Inserts depends on how long our patterns get
- Thus we want these to be VERY FAST
- Sorted Array takes much too long for Inserts
- BST would be ok, but even lgM per Lookup is probably more time than we want to spend
- Would yield total of Theta(MlgM) total time
- Do we have any better options?
## LZW Implementation Issues - Two most likely candidates for ENCODING dictionary:
- Trie or Hash Table
- Both allow lookups in time proportional to string length, independent of the number of strings
- Trie insert is a bit tricky, but if we use a DLB, it is not too bad
- Consider implementation done in lzw.c
- Hash table used, but in a clever way
- Instead of storing the entire string each time (wasting a lot of space), they store the codeword for the "prefix" and the last character
- This works because strings are built one character at a time, always adding on the strings already in the dictionary
- See code and board
**Share with your friends:** |