Page 13/24 Date 03.05.2017 Size 186.06 Kb. #18863
For DECODING, the idea is even easier Now codewords are the key values, and the strings are returned We don't need to hash anything We can simply use the codeword values to index an array of strings In lzw.c it is not actually the string but rather the prefix code, last character pair in the same way as for encoding See code LZW Implementation Issues How many bits to use for the codewords Fewer bits: Smaller codewords, giving compression EARLIER in the process Fewer available codewords, limiting the compression LATER in the process More bits: Larger codewords, delaying actual compression until longer patterns are found More available codewords, allowing for greater compression LATER in the process Ex: Consider 10 bits vs. 16 bits LZW Implementation Issues Can we get the "best of both worlds"? We'd like to use fewer bits earlier in the process , to get compression sooner We'd like to use more bits later in the process, to get greater compression later in the file In fact this is exactly what the Unix compress algorithm does It starts out using 9 bits for codewords, adding an extra bit when all codewords for previous size are used. Ex: 9 bits for codewords 0-511 10 bits for codewords 512-1023 11 bits for codewords 1024-2047 etc Decompress does the same so it works! LZW Implementation Issues What to do when / if we run out of codewords If we use a block code of a specific size, we have a finite number of codewords that we can represent Even the "compress" version would eventually stop adding bits due to I/O issues (we will discuss next) When all codewords have been used, what do we do? LZW Implementation Issues Two primary options , each with advantages and disadvantages: Keep compressing as before, but simply stop adding new entries to the dictionary Adv: Maintains long patterns that have already been built up Disadv: If file content changes (with new patterns) those will not be compressed effectively Throw out entire dictionary, then start again with the single characters Adv: Allows new patterns to be compressed Disadv: Until new patterns are built and added to the dictionary, compression is minimal LZW Implementation Issues How to do I/O with fractions of bytes Unless we pick an exact multiple of a byte for our codeword size (8, 16, 24, 32 bits) we will need to input and output fractions of bytes We will not actually input / output fractions of bytes Rather we will keep a buffer and read / write exact numbers of bytes, processing the necessary bits from the buffer This involves some bit operations to be done Shifting, bitwise OR, etc. See lzw.c – go to recitation! LZW vs Huffman In practice, which is better, LZW or Huffman? For most files , LZW gives better compression ratios It is also generally better for compressing archived directories of files Why? Files can build up very long patterns, allowing LZW to get a great deal of compression Different file types do not "hurt" each other with LZW as they do with Huffman – with each type we simply have to build up new patterns LZW vs. Huffman Let's compare See compare.txt handout Note that for a single text file, Huffman does pretty well For large archived file, Huffman does not as well gzip outperforms Huffman and LZW Combo of LZ77 and Huffman See: http://en.wikipedia.org/wiki/Gzip http://www.gzip.org/ http://www.gzip.org/algorithm.txt Limitations on Compression and Entropy How much can we compress a file (in a lossless way)? Ex: Take a large file, and recursively compress it K times If K is large enough, maybe I can compress the entire file down to 1 bit! Of course this won't work, but why not? Clearly, we cannot unambiguously decompress this file – we could make an infinite number of "original" files Limitations on Compression and Entropy Generally speaking, the amount we can compress a file is dependent upon the amount of entropy in the data Informally, entropy is the amount of uncertainty / randomness in the data Information entropy is very similar in nature to thermodynamic entropy, which you may have discussed in a physics or chemistry class The more entropy, the less we can compress, and the less entropy the more we can compress Ex: File containing all A's can be heavily compressed since all of the characters are the same Ex: File containing random bits cannot be compressed at all Limitations on Compression and Entropy When we compress a file (ex: using compress or gzip) we are taking patterns / repeated sequences and substituting codewords that have much more entropy Attempts to compress the result (even using a different algorithm) may have little or no further compression However, in some cases it may be worth trying, if the two algorithms are very different For more info, see: http://en.wikipedia.org/wiki/Lossless_data_compression http://en.wikipedia.org/wiki/Information_entropy Some Mathematical Algorithms Integer Multiplication With predefined int variables we think of multiplication as being Theta(1) Is the multiplication really a constant time op? What if we need to multiply very large ints? Ex: RSA that we will see shortly needs ints of sizes up to 2048 bits Now we need to think of good integer multiplication algorithms No -- it is constant due to the constant size of the numbers (32 bits), not due to the algorithm Share with your friends:
The database is protected by copyright ©sckool.org 2023
send message