These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else



Download 186.06 Kb.
Page13/24
Date03.05.2017
Size186.06 Kb.
1   ...   9   10   11   12   13   14   15   16   ...   24

LZW Implementation Issues

      • For DECODING, the idea is even easier
        • Now codewords are the key values, and the strings are returned
        • We don't need to hash anything
        • We can simply use the codeword values to index an array of strings
          • In lzw.c it is not actually the string but rather the prefix code, last character pair in the same way as for encoding
          • See code

LZW Implementation Issues

  • How many bits to use for the codewords
    • Fewer bits:
      • Smaller codewords, giving compression EARLIER in the process
      • Fewer available codewords, limiting the compression LATER in the process
    • More bits:
      • Larger codewords, delaying actual compression until longer patterns are found
      • More available codewords, allowing for greater compression LATER in the process
    • Ex: Consider 10 bits vs. 16 bits

LZW Implementation Issues

    • Can we get the "best of both worlds"?
      • We'd like to use fewer bits earlier in the process, to get compression sooner
      • We'd like to use more bits later in the process, to get greater compression later in the file
      • In fact this is exactly what the Unix compress algorithm does
        • It starts out using 9 bits for codewords, adding an extra bit when all codewords for previous size are used. Ex:
          • 9 bits for codewords 0-511
          • 10 bits for codewords 512-1023
          • 11 bits for codewords 1024-2047
          • etc
        • Decompress does the same so it works!

LZW Implementation Issues

  • What to do when / if we run out of codewords
    • If we use a block code of a specific size, we have a finite number of codewords that we can represent
      • Even the "compress" version would eventually stop adding bits due to I/O issues (we will discuss next)
    • When all codewords have been used, what do we do?

LZW Implementation Issues

      • Two primary options, each with advantages and disadvantages:
        • Keep compressing as before, but simply stop adding new entries to the dictionary
          • Adv: Maintains long patterns that have already been built up
          • Disadv: If file content changes (with new patterns) those will not be compressed effectively
        • Throw out entire dictionary, then start again with the single characters
          • Adv: Allows new patterns to be compressed
          • Disadv: Until new patterns are built and added to the dictionary, compression is minimal

LZW Implementation Issues

  • How to do I/O with fractions of bytes
    • Unless we pick an exact multiple of a byte for our codeword size (8, 16, 24, 32 bits) we will need to input and output fractions of bytes
    • We will not actually input / output fractions of bytes
      • Rather we will keep a buffer and read / write exact numbers of bytes, processing the necessary bits from the buffer
      • This involves some bit operations to be done
        • Shifting, bitwise OR, etc.
    • See lzw.c – go to recitation!

LZW vs Huffman

  • In practice, which is better, LZW or Huffman?
    • For most files, LZW gives better compression ratios
    • It is also generally better for compressing archived directories of files
    • Why?
  • Files can build up very long patterns, allowing LZW to get a great deal of compression
  • Different file types do not "hurt" each other with LZW as they do with Huffman – with each type we simply have to build up new patterns

LZW vs. Huffman

  • Let's compare
    • See compare.txt handout
    • Note that for a single text file, Huffman does pretty well
    • For large archived file, Huffman does not as well
    • gzip outperforms Huffman and LZW
      • Combo of LZ77 and Huffman
      • See: http://en.wikipedia.org/wiki/Gzip
        • http://www.gzip.org/
        • http://www.gzip.org/algorithm.txt

Limitations on Compression and Entropy

  • How much can we compress a file (in a lossless way)?
    • Ex: Take a large file, and recursively compress it K times
      • If K is large enough, maybe I can compress the entire file down to 1 bit!
    • Of course this won't work, but why not?
      • Clearly, we cannot unambiguously decompress this file – we could make an infinite number of "original" files

Limitations on Compression and Entropy

    • Generally speaking, the amount we can compress a file is dependent upon the amount of entropy in the data
      • Informally, entropy is the amount of uncertainty / randomness in the data
        • Information entropy is very similar in nature to thermodynamic entropy, which you may have discussed in a physics or chemistry class
      • The more entropy, the less we can compress, and the less entropy the more we can compress
        • Ex: File containing all A's can be heavily compressed since all of the characters are the same
        • Ex: File containing random bits cannot be compressed at all

Limitations on Compression and Entropy

    • When we compress a file (ex: using compress or gzip) we are taking patterns / repeated sequences and substituting codewords that have much more entropy
    • Attempts to compress the result (even using a different algorithm) may have little or no further compression
      • However, in some cases it may be worth trying, if the two algorithms are very different
    • For more info, see:
    • http://en.wikipedia.org/wiki/Lossless_data_compression
    • http://en.wikipedia.org/wiki/Information_entropy

Some Mathematical Algorithms

  • Integer Multiplication
    • With predefined int variables we think of multiplication as being Theta(1)
      • Is the multiplication really a constant time op?
    • What if we need to multiply very large ints?
      • Ex: RSA that we will see shortly needs ints of sizes up to 2048 bits
    • Now we need to think of good integer multiplication algorithms
  • No -- it is constant due to the constant size of the numbers (32 bits), not due to the algorithm

Download 186.06 Kb.

Share with your friends:
1   ...   9   10   11   12   13   14   15   16   ...   24




The database is protected by copyright ©sckool.org 2020
send message

    Main page