These notes are intended for use by students in cs1501 at the University of Pittsburgh and no one else



Download 186.06 Kb.
Page9/24
Date03.05.2017
Size186.06 Kb.
1   ...   5   6   7   8   9   10   11   12   ...   24

Boyer Moore

  • int mischarsearch(char *p, char *a)
  • {
  • int i, j, t, M = strlen(p), N = strlen(a);
  • initskip(p);
  • for (i = M-1, j = M-1; j >= 0; i--, j--)
  • while (a[i] != p[j])
  • {
  • t = skip[index(a[i])];
  • i += (M-j > t) ? M-j : t; // if we have
  • // passed more chars (r to l) than
  • // t, skip that amount rather than t
  • if (i >= N) return N;
  • j = M-1;
  • }
  • return i+1;
  • }

Boyer Moore

    • Can MC ever be poor?
      • Yes
      • Discuss how and look at example
      • By itself the runtime could be Theta(NM) – same as worst case for brute force algorithm
    • This is why the BM algorithm has two heuristics
      • The second heuristic guarantees that the run-time will never be worse than linear
    • Look at comparison table
      • Discuss

Compression

  • Why do we use compression?
    • To save space
      • Hard drives, despite increasing size, always seem to be filled with new programs and data files
      • 3.5" floppy drives are still used and are still very low capacity
    • To save time/bandwidth
      • Many programs and files are now downloaded from the internet
      • Most people still have relatively slow connections
      • Compressed files allow for faster transfer times, and allow more people to use a server at the same time

Compression

    • Good for audio and video applications, where the perception of the user is required
      • Gives extremely large amounts of compression, which is useful for large audio and video files
      • If the quality is degraded somewhat, user may not notice or may not care
      • Many sophisticated algorithms are used to determine what data to "lose" and how it will have the least degradation to the overall quality of the file

Lossless Compression

    • Lossless – original file is exactly reproduced when compressed file is decompressed
      • Or, D(C(X)) = X where C is the compression and D is the decompression
      • Necessary for files that must be exactly restored
        • Ex: .txt, .exe, .doc, .xls, .java, .cpp and others
      • Will also work for audio and video, but will not realize the same compression as lossy
        • However, since it works for all file types, it can be used effectively for archiving all data in a directory or in multiple directories

Lossless Compression

  • Most modern lossless compression techniques have roots in 3 algorithms:
    • Huffman
      • Variable length, 1 codeword-per-character code
    • LZ77
      • Uses a "sliding" window to compress groups of characters at a time
    • LZ78 / LZW
      • Uses a dictionary to store previously seen patterns, also compressing groups of characters at a time
  • We will discuss Huffman and LZW in more detail

Lossless Compression

  • What is used in actual compression programs?
    • Here are a few
      • unix compress, gif: LZW
      • pkzip, zip, gzip: LZ77 + Huffman

Huffman Compression

  • Background:
    • Huffman works with arbitrary bytes, but the ideas are most easily explained using character data, so we will discuss it in those terms
    • Consider extended ASCII character set:
      • 8 bits per character
      • BLOCK code, since all codewords are the same length
        • 8 bits yield 256 characters
        • In general, block codes give:
          • For K bits, 2K characters
          • For N characters, log2N bits are required
      • Easy to encode and decode

Huffman Compression

    • What if we could use variable length codewords, could we do better than ASCII?
      • Idea is that different characters would use different numbers of bits
      • If all characters have the same frequency of occurrence per character we cannot improve over ASCII
    • What if characters had different freqs of occurrence?
      • Ex: In English text, letters like E, A, I, S appear much more frequently than letters like Q, Z, X
      • Can we somehow take advantage of these differences in our encoding?

Huffman Compression

    • First we need to make sure that variable length coding is feasible
      • Decoding a block code is easy – take the next 8 bits
      • Decoding a variable length code is not so obvious
      • In order to decode unambiguously, variable length codes must meet the prefix property
    • Ok, so now how do we compress?
      • Let's use fewer bits for our more common characters, and more bits for our less common characters

Huffman Compression

  • Huffman Algorithm:
    • Assume we have K characters and that each uncompressed character has some weight associated with it (i.e. frequency)
    • Initialize a forest, F, to have K single node trees in it, one tree per character, also storing the character's weight
    • while (|F| > 1)
      • Find the two trees, T1 and T2, with the smallest weights
      • Create a new tree, T, whose weight is the sum of T1 and T2
      • Remove T1 and T2 from the F, and add them as left and right children of T
      • Add T to F

Download 186.06 Kb.

Share with your friends:
1   ...   5   6   7   8   9   10   11   12   ...   24




The database is protected by copyright ©sckool.org 2020
send message

    Main page