Machine Learning and the Axioms of Probability



Download 55,56 Kb.
Date conversion04.11.2017
Size55,56 Kb.

Machine Learning and the Axioms of Probability

  • Brigham S. Anderson
  • www.cs.cmu.edu/~brigham
  • brigham@cmu.edu
  • School of Computer Science
  • Carnegie Mellon University

Probability

  • The world is a very uncertain place
  • 30 years of Artificial Intelligence and Database research danced around this fact
  • And then a few AI researchers decided to use some ideas from the eighteenth century

What we’re going to do

  • We will review the fundamentals of probability.
  • It’s really going to be worth it
  • You’ll see examples of probabilistic analytics in action:
    • Inference,
    • Anomaly detection, and
    • Bayes Classifiers

Discrete Random Variables

  • A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.
  • Examples
    • A = The US president in 2023 will be male
    • A = You wake up tomorrow with a headache
    • A = You have influenza

Probabilities

  • We write P(A) as “the probability that A is true”
  • We could at this point spend 2 hours on the philosophy of this.
  • We’ll spend slightly less...

Sample Space

  • Definition 1.
  • The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment
  • The elements of the sample space are called outcomes.

Sample Spaces

  • Sample space of a coin flip:
  • S = {H, T}
  • H
  • T

Sample Spaces

  • Sample space of a die roll:
  • S = {1, 2, 3, 4, 5, 6}

Sample Spaces

  • Sample space of three die rolls?
  • S = {111,112,113,…,
  • …,664,665,666}

Sample Spaces

  • Sample space of a single draw from a deck of cards:
  • S={As,Ac,Ah,Ad,2s,2c,2h,…
  • …,Ks,Kc,Kd,Kh}

So Far…

  • Definition
  • Example
  • The sample space is the set of all possible worlds.
  • {As,Ac,Ah,Ad,2s,2c,2h,…
  • …,Ks,Kc,Kd,Kh}
  • An outcome is an element of the sample space.
  • 2c

Events

  • Definition 2.
  • An event is any subset of S (including S itself)

Events

  • Event: “Jack”
  • Sample Space of card draw
  • The Sample Space is the set of all outcomes.
  • An Outcome is a possible world.
  • An Event is a set of outcomes

Events

  • Event: “Hearts”
  • Sample Space of card draw
  • The Sample Space is the set of all outcomes.
  • An Outcome is a possible world.
  • An Event is a set of outcomes

Events

  • Event: “Red and Face”
  • Sample Space of card draw
  • The Sample Space is the set of all outcomes.
  • An Outcome is a possible world.
  • An Event is a set of outcomes

Definitions

  • Definition
  • Example
  • The sample space is the set of all possible worlds.
  • {As,Ac,Ah,Ad,2s,2c,2h,…
  • …,Ks,Kc,Kd,Kh}
  • An outcome is a single point in the sample space.
  • 2c
  • An event is a set of outcomes from the sample space.
  • {2h,2c,2s,2d}

Events

  • Definition 3.
  • Two events A and B are mutually exclusive if A^B=Ø.
  • Definition 4.
  • If A1, A2, … are mutually exclusive and A1 A2 … = S, then the collection A1, A2, … forms a partition of S.
  • clubs
  • hearts
  • spades
  • diamonds

Probability

  • Definition 5.
  • Given a sample space S, a probability function is a function that maps each event in S to a real number, and satisfies
  • P(A) ≥ 0 for any event A in S
  • P(S) = 1
  • For any number of mutually exclusive events
  • A1, A2, A3 …, we have
  • P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +…
  • *
  • *
  • This definition of the domain of this function is not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields)

Definitions

  • Definition
  • Example
  • The sample space is the set of all possible worlds.
  • {As,Ac,Ah,Ad,2s,2c,2h,…
  • …,Ks,Kc,Kd,Kh}
  • An outcome is a single point in the sample space.
  • 4c
  • An event is a set of one or more outcomes
  • Card is “Red”
  • P(E) maps event E to a real number and satisfies the axioms of probability
  • P(Red) = 0.50
  • P(Black) = 0.50

Misconception

  • A
  • ~A
  • The relative area of the events determines their probability
      • …in a Venn diagram it does, but not in general.
      • However, the “area equals probability” rule is guaranteed to result in axiom-satisfying probability functions.
  • We often assume, for example, that the probability of “heads” is equal to “tails” in absence of other information…
  • But this is totally outside the axioms!

Creating a Valid P()

  • One convenient way to create an axiom-satisfying probability function:
      • Assign a probability to each outcome in S
      • Make sure they sum to one
      • Declare that P(A) equals the sum of outcomes in event A

Everyday Example

  • Assume you are a doctor.
  • This is the sample space of “patients you might see on any given day”.
  • Non-smoker, female, diabetic, headache, good insurance, etc…
  • Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc…
  • Outcomes

Everyday Example

  • Number of elements in the “patient space”:
  • 100 jillion
  • Are these patients equally likely to occur?
  • Again, generally not. Let’s assume for the moment that they are, though.
  • …which roughly means “area equals probability”

Everyday Example

  • F
  • Event: Patient has Flu
  • Size of set “F”:
  • 2 jillion
  • (Exactly 2 jillion of the points in the sample space have flu.)
  • Size of “patient space”:
  • 100 jillion
  • = 0.02
  • PpatientSpace(F) =

Everyday Example

  • F
  • = 0.02
  • PpatientSpace(F) =
  • From now on, the subscript on P() will
  • be omitted…

These Axioms are Not to be Trifled With

  • There have been attempts to do different methodologies for uncertainty
      • Fuzzy Logic
      • Three-valued logic
      • Dempster-Shafer
      • Non-monotonic reasoning
  • But the axioms of probability are the only system with this property:
  • If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]

Theorems from the Axioms

  • Axioms
  • P(A) ≥ 0 for any event A in S
  • P(S) = 1
  • For any number of mutually exclusive events
  • A1, A2, A3 …, we have
  • P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +…
  • Theorem.
  • If P is a probability function and A is an event in S, then
  • P(~A) = 1 – P(A)
  • Proof:
  • (1) Since A and ~A partition S, P(A ~A) = P(S) = 1
  • (2) Since A and ~A are disjoint, P(A ~A) = P(A) + P(~A)
  • Combining (1) and (2) gives the result

Multivalued Random Variables

  • Suppose A can take on more than 2 values
  • A is a random variable with arity k if it can take on exactly one value out of {A1,A2, ... Ak}, and
  • The events {A1,A2,…,Ak} partition S, so

Elementary Probability in Pictures

  • P(~A) + P(A) = 1
  • A
  • ~A

Elementary Probability in Pictures

  • P(B) = P(B, A) + P(B, ~A)
  • A
  • ~A
  • B

Elementary Probability in Pictures

  • A1
  • A2
  • A3

Elementary Probability in Pictures

  • A1
  • A2
  • A3
  • B
  • Useful!

Conditional Probability

  • Assume once more that you are a doctor.
  • Again, this is the sample space of “patients you might see on any given day”.

Conditional Probability

  • F
  • Event: Flu
  • P(F) = 0.02

Conditional Probability

  • Event: Headache
  • P(H) = 0.10
  • H
  • F

Conditional Probability

  • P(F) = 0.02
  • P(H) = 0.10
  • …we still need to specify the interaction between flu and headache…
  • Define
  • P(H|F) = Fraction of F’s outcomes which are also in H
  • H
  • F

Conditional Probability

  • H
  • F
  • P(F) = 0.02
  • P(H) = 0.10
  • P(H|F) = 0.50
  • 0.01
  • 0.01
  • 0.89
  • 0.09
  • H = “headache”
  • F = “flu”

Conditional Probability

  • H = “headache”
  • F = “flu”
  • P(H|F) = Fraction of flu worlds in which patient has a headache
  • = #worlds with flu and headache
  • ------------------------------------
  • #worlds with flu
  • = Size of “H and F” region
  • ------------------------------
  • Size of “F” region
  • = P(H, F)
  • ----------
  • P(F)
  • F
  • 0.01
  • 0.01
  • 0.89
  • 0.09
  • H

Conditional Probability

  • Definition.
  • If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is
  • The Chain Rule
  • A simple rearrangement of the above equation yields
  • Main Bayes
  • Net concept!

Probabilistic Inference

  • H = “Have a headache”
  • F = “Coming down with Flu”
  • P(H) = 0.10
  • P(F) = 0.02
  • P(H|F) = 0.50
  • One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”
  • Is this reasoning good?
  • H
  • F

Probabilistic Inference

  • H
  • F
  • H = “Have a headache”
  • F = “Coming down with Flu”
  • P(H) = 0.10
  • P(F) = 0.02
  • P(H|F) = 0.50

What we just did…

  • P(A,B) P(A|B) P(B)
  • P(B|A) = ----------- = ---------------
  • P(A) P(A)
  • This is Bayes Rule
  • Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

More General Forms of Bayes Rule

More General Forms of Bayes Rule

Independence

  • Definition.
  • Two events, A and B, are statistically independent if
  • Which is equivalent to
  • Important for
  • Bayes Nets

Representing P(A,B,C)

  • How can we represent the function P(A)?
  • P(A,B)?
  • P(A,B,C)?

The Joint Probability Table

  • Recipe for making a joint distribution of M variables:
  • Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).
  • For each combination of values, say how probable it is.
  • If you subscribe to the axioms of probability, those numbers must sum to 1.
  • A
  • B
  • C
  • Prob
  • 0
  • 0
  • 0
  • 0.30
  • 0
  • 0
  • 1
  • 0.05
  • 0
  • 1
  • 0
  • 0.10
  • 0
  • 1
  • 1
  • 0.05
  • 1
  • 0
  • 0
  • 0.05
  • 1
  • 0
  • 1
  • 0.10
  • 1
  • 1
  • 0
  • 0.25
  • 1
  • 1
  • 1
  • 0.10
  • Example: P(A, B, C)
  • A
  • B
  • C
  • 0.05
  • 0.25
  • 0.10
  • 0.05
  • 0.05
  • 0.10
  • 0.10
  • 0.30

Using the Joint

  • One you have the JPT you can ask for the probability of any logical expression
  • …what is P(Poor,Male)?

Using the Joint

  • P(Poor, Male) = 0.4654
  • …what is P(Poor)?

Using the Joint

  • P(Poor) = 0.7604
  • …what is P(Poor|Male)?

Inference with the Joint

Inference with the Joint

  • P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Inference is a big deal

  • I’ve got this evidence. What’s the chance that this conclusion is true?
    • I’ve got a sore neck: how likely am I to have meningitis?
    • I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep?
  • There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

Where do Joint Distributions come from?

  • Idea One: Expert Humans
  • Idea Two: Simpler probabilistic facts and some algebra
  • Example: Suppose you knew
  • P(A) = 0.5
  • P(B|A) = 0.2
  • P(B|~A) = 0.1
  • P(C|A,B) = 0.1
  • P(C|A,~B) = 0.8
  • P(C|~A,B) = 0.3
  • P(C|~A,~B) = 0.1
  • Then you can automatically compute the JPT using the chain rule
  • P(A,B,C) = P(A) P(B|A) P(C|A,B)
  • Bayes Nets are a systematic way to do this.

Where do Joint Distributions come from?

  • Idea Three: Learn them from data!
      • Prepare to witness an impressive learning algorithm….

Learning a JPT

  • Then fill in each row with
  • A
  • B
  • C
  • Prob
  • 0
  • 0
  • 0
  • ?
  • 0
  • 0
  • 1
  • ?
  • 0
  • 1
  • 0
  • ?
  • 0
  • 1
  • 1
  • ?
  • 1
  • 0
  • 0
  • ?
  • 1
  • 0
  • 1
  • ?
  • 1
  • 1
  • 0
  • ?
  • 1
  • 1
  • 1
  • ?
  • A
  • B
  • C
  • Prob
  • 0
  • 0
  • 0
  • 0.30
  • 0
  • 0
  • 1
  • 0.05
  • 0
  • 1
  • 0
  • 0.10
  • 0
  • 1
  • 1
  • 0.05
  • 1
  • 0
  • 0
  • 0.05
  • 1
  • 0
  • 1
  • 0.10
  • 1
  • 1
  • 0
  • 0.25
  • 1
  • 1
  • 1
  • 0.10
  • Fraction of all records in which
  • A and B are True but C is False

Example of Learning a JPT

  • This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

Where are we?

  • We have recalled the fundamentals of probability
  • We have become content with what JPTs are and how to use them
  • And we even know how to learn JPTs from data.

Density Estimation

  • Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation
  • A Density Estimator learns a mapping from a set of attributes to a Probability
  • Density
  • Estimator
  • Probability
  • Input
  • Attributes

Evaluating a density estimator

  • Given a record x, a density estimator M can tell you how likely the record is:
  • Given a dataset with R records, a density estimator can tell you how likely the dataset is:
    • (Under the assumption that all records were independently generated from the probability function)

A small dataset: Miles Per Gallon

  • From the UCI repository (thanks to Ross Quinlan)

A small dataset: Miles Per Gallon

  • 192 Training Set Records

A small dataset: Miles Per Gallon

  • 192 Training Set Records

Log Probabilities

  • Since probabilities of datasets get so small we usually use log probabilities

A small dataset: Miles Per Gallon

  • 192 Training Set Records

Summary: The Good News

  • The JPT allows us to learn P(X) from data.
    • Can do inference: P(E1|E2)
          • Automatic Doctor, Recommender, etc
    • Can do anomaly detection
      • spot suspicious / incorrect records
      • (e.g., credit card fraud)
    • Can do Bayesian classification
      • Predict the class of a record
      • (e.g, predict cancerous/not-cancerous)

Summary: The Bad News

  • Density estimation with JPTs is trivial, mindless and dangerous

Using a test set

  • An independent test set with 196 cars has a much worse log likelihood than it had on the training set
  • (actually it’s a billion quintillion quintillion quintillion quintillion times less likely)
  • ….Density estimators can overfit. And the JPT estimator is the overfittiest of them all!

Overfitting Density Estimators

  • If this ever happens, it means there are certain combinations that we learn are “impossible”

Using a test set

  • The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020
  • We need Density Estimators that are less prone to overfitting

Is there a better way?

  • The problem with the JPT is that it just mirrors the training data.
      • In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it!
  • We need to represent the probability function with fewer parameters…

Aside: Bayes Nets

Bayes Nets

  • What are they?
    • Bayesian nets are a framework for representing and analyzing models involving uncertainty
  • What are they used for?
    • Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining
  • How are they different from other knowledge representation and probabilistic analysis tools?
    • Uncertainty is handled in a mathematically rigorous yet efficient and simple way

Bayes Net Concepts

    • Chain Rule
      • P(A,B) = P(A) P(B|A)
    • Conditional Independence
      • P(A|B,C) = P(A|B)

A Simple Bayes Net

  • Let’s assume that we already have P(Mpg,Horse)
  • How would you rewrite this using the Chain rule?
  • 0.48
  • 0.12
  • bad
  • 0.04
  • 0.36
  • good
  • high
  • low
  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse) =

Review: Chain Rule

  • 0.48
  • 0.12
  • bad
  • 0.04
  • 0.36
  • good
  • high
  • low
  • P(Mpg, Horse)
  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse)
  • P(good) = 0.4
  • P( bad) = 0.6
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Mpg)
  • P(Horse|Mpg)
  • *

Review: Chain Rule

  • 0.48
  • 0.12
  • bad
  • 0.04
  • 0.36
  • good
  • high
  • low
  • P(Mpg, Horse)
  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse)
  • P(good) = 0.4
  • P( bad) = 0.6
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Mpg)
  • P(Horse|Mpg)
  • *
  • = P(good) * P(low|good) = 0.4 * 0.89
  • = P(good) * P(high|good) = 0.4 * 0.11
  • = P(bad) * P(low|bad)
  • = 0.6 * 0.21
  • = P(bad) * P(high|bad)
  • = 0.6 * 0.79

How to Make a Bayes Net

      • P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
  • Mpg
  • Horse

How to Make a Bayes Net

      • P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
  • Mpg
  • Horse
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.90
  • P( low| bad) = 0.21
  • P(high|good) = 0.10
  • P(high| bad) = 0.79
  • P(Horse|Mpg)

How to Make a Bayes Net

  • Mpg
  • Horse
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.90
  • P( low| bad) = 0.21
  • P(high|good) = 0.10
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • Each node is a probability function
  • Each arc denotes conditional dependence

How to Make a Bayes Net

  • So, what have we accomplished thus far?
  • Nothing;
  • we’ve just “Bayes Net-ified” the P(Mpg, Horse) JPT using the Chain rule.
  • …the real excitement starts when we wield conditional independence
  • Mpg
  • Horse
  • P(Mpg)
  • P(Horse|Mpg)

How to Make a Bayes Net

  • Before we continue, we need a worthier opponent than puny P(Mpg, Horse)…
  • We’ll use P(Mpg, Horse, Accel):
  • P(good, low,slow) = 0.37
  • P(good, low,fast) = 0.01
  • P(good,high,slow) = 0.02
  • P(good,high,fast) = 0.00
  • P( bad, low,slow) = 0.10
  • P( bad, low,fast) = 0.12
  • P( bad,high,slow) = 0.16
  • P( bad,high,fast) = 0.22
  • P(Mpg,Horse,Accel)
  • * Note: I made these up…

How to Make a Bayes Net

  • Step 1: Rewrite joint using the Chain rule.
      • P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
  • Note:
  • Obviously, we could have written this 3!=6 different ways…
      • P(M, H, A) = P(M) * P(H|M) * P(A|M,H)
      • = P(M) * P(A|M) * P(H|M,A)
      • = P(H) * P(M|H) * P(A|H,M)
      • = P(H) * P(A|H) * P(M|H,A)
      • = …
      • = …

How to Make a Bayes Net

  • Mpg
  • Horse
  • Accel
  • Step 1: Rewrite joint using the Chain rule.
      • P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

How to Make a Bayes Net

  • Mpg
  • Horse
  • Accel
  • P(Mpg)
  • P(Horse|Mpg)
  • P(Accel|Mpg,Horse)

How to Make a Bayes Net

  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.90
  • P( low| bad) = 0.21
  • P(high|good) = 0.10
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow|good, low) = 0.97
  • P(slow|good,high) = 0.15
  • P(slow| bad, low) = 0.90
  • P(slow| bad,high) = 0.05
  • P(fast|good, low) = 0.03
  • P(fast|good,high) = 0.85
  • P(fast| bad, low) = 0.10
  • P(fast| bad,high) = 0.95
  • P(Accel|Mpg,Horse)
  • * Note: I made these up too…

How to Make a Bayes Net

  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow|good, low) = 0.97
  • P(slow|good,high) = 0.15
  • P(slow| bad, low) = 0.90
  • P(slow| bad,high) = 0.05
  • P(fast|good, low) = 0.03
  • P(fast|good,high) = 0.85
  • P(fast| bad, low) = 0.10
  • P(fast| bad,high) = 0.95
  • P(Accel|Mpg,Horse)
  • A Miracle Occurs!
  • You are told by God (or another domain expert)
  • that Accel is independent of Mpg given Horse!
  • i.e., P(Accel | Mpg, Horse) = P(Accel | Horse)

How to Make a Bayes Net

  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow| low) = 0.22
  • P(slow|high) = 0.64
  • P(fast| low) = 0.78
  • P(fast|high) = 0.36
  • P(Accel|Horse)

How to Make a Bayes Net

  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow| low) = 0.22
  • P(slow|high) = 0.64
  • P(fast| low) = 0.78
  • P(fast|high) = 0.36
  • P(Accel|Horse)
  • Thank you, domain expert!
  • Now I only need to
  • learn 5 parameters
  • instead of 7 from my data!
  • My parameter estimates
  • will be more accurate as
  • a result!

Independence

  • “The Acceleration does not depend on the Mpg once I know the Horsepower.”
  • This can be specified very simply:
        • P(Accel Mpg, Horse) = P(Accel | Horse)
  • This is a powerful statement!
  • It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Bayes Nets Formalized

  • A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:
    • V is a set of vertices.
    • E is a set of directed edges joining vertices. No loops of any length are allowed.
  • Each vertex in V contains the following information:
    • A Conditional Probability Table (CPT) indicating how this variable’s probabilities depend on all possible combinations of parental values.

Bayes Nets Summary

  • Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence.
  • They can do everything a JPT can do (like quick, cheap lookups of probabilities)

The good news

  • We can do inference.
  • We can compute any conditional probability:
  • P( Some variable  Some other variable values )

The good news

  • We can do inference.
  • We can compute any conditional probability:
  • P( Some variable  Some other variable values )
  • Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.
  • How much work is the above computation?

The sad, bad news

  • Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive:
  • Exponential in the number of variables.
  • But perhaps there are faster ways of querying Bayes nets?
  • In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time.
  • So we’ve just got to program our computer to do those tricks too, right?
  • Sadder and worse news:
  • General querying of Bayes nets is NP-complete.

Case Study I

  • Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).
  • Diagnostic system for lymph-node diseases.
  • 60 diseases and 100 symptoms and test-results.
  • 14,000 probabilities
  • Expert consulted to make net.
    • 8 hours to determine variables.
    • 35 hours for net topology.
    • 40 hours for probability table values.
  • Apparently, the experts found it quite easy to invent the causal links and probabilities.
  • Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

Bayes Net Info

  • GUI Packages:
    • Genie -- Free
    • Netica -- $$
    • Hugin -- $$
  • Non-GUI Packages:
    • All of the above have APIs
    • BNT for MATLAB
    • AUTON code (learning extremely large networks of tens of thousands of nodes)

Bayes Nets and Machine Learning

Machine Learning Tasks

  • Classifier
  • Data point x
  • Anomaly
  • Detector
  • Data point x
  • P(x)
  • P(C | x)
  • Inference
  • Engine
  • Evidence e1
  • P(e2 | e1)
  • Missing Variables e2

What is an Anomaly?

  • An irregularity that cannot be explained by simple domain models and knowledge
  • Anomaly detection only needs to learn from examples of “normal” system behavior.
      • Classification, on the other hand, would need examples labeled “normal” and “not-normal”

Anomaly Detection in Practice

  • Monitoring computer networks for attacks.
  • Monitoring population-wide health data for outbreaks or attacks.
  • Looking for suspicious activity in bank transactions
  • Detecting unusual eBay selling/buying behavior.

JPT Anomaly Detector

  • Suppose we have the following model:
  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse) =
  • We’re trying to detect anomalous cars.
  • If the next example we see is , how anomalous is it?

JPT Anomaly Detector

  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse) =
  • How likely is
  • ?
  • Could not be easier! Just look up the entry in the JPT!
  • Smaller numbers are more anomalous in that the
  • model is more surprised to see them.

Bayes Net Anomaly Detector

  • How likely is
  • ?
  • Mpg
  • Horse
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.90
  • P( low| bad) = 0.21
  • P(high|good) = 0.10
  • P(high| bad) = 0.79
  • P(Horse|Mpg)

Bayes Net Anomaly Detector

  • How likely is
  • ?
  • Mpg
  • Horse
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.90
  • P( low| bad) = 0.21
  • P(high|good) = 0.10
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • Again, trivial!
  • We need to do one tiny lookup
  • for each variable in the network!

Machine Learning Tasks

  • Classifier
  • Data point x
  • Anomaly
  • Detector
  • Data point x
  • P(x)
  • P(C | x)
  • Inference
  • Engine
  • Evidence e1
  • P(E2 | e1)
  • Missing Variables E2

Bayes Classifiers

  • DT
  • BC

Bayes Classifiers in 1 Slide

  • Bayes classifiers just do inference.
  • That’s it.
  • The “algorithm”
    • Learn P(class,X)
    • For a given input x, infer P(class|x)
    • Choose the class with the highest probability

JPT Bayes Classifier

  • Suppose we have the following model:
  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse) =
  • We’re trying to classify cars as Mpg = “good” or “bad”
  • If the next example we see is Horse = “low”, how do we classify it?

JPT Bayes Classifier

  • P(good, low) = 0.36
  • P(good,high) = 0.04
  • P( bad, low) = 0.12
  • P( bad,high) = 0.48
  • P(Mpg, Horse) =
  • How do we classify
  • ?
  • The P(good | low) = 0.75,
  • so we classify the example
  • as “good”

Bayes Net Classifier

  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow| low) = 0.95
  • P(slow|high) = 0.11
  • P(fast| low) = 0.05
  • P(fast|high) = 0.89
  • P(Accel|Horse)
  • We’re trying to classify cars as Mpg = “good” or “bad”
  • If the next example we see is how do we classify it?

Bayes Net Bayes Classifier

  • Suppose we get a
  • example?
  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow| low) = 0.95
  • P(slow|high) = 0.11
  • P(fast| low) = 0.05
  • P(fast|high) = 0.89
  • P(Accel|Horse)
  • Note: this is not exactly 0.75 because I
  • rounded some of the CPT numbers earlier…

Bayes Net Bayes Classifier

  • Mpg
  • Horse
  • Accel
  • P(good) = 0.4
  • P( bad) = 0.6
  • P(Mpg)
  • P( low|good) = 0.89
  • P( low| bad) = 0.21
  • P(high|good) = 0.11
  • P(high| bad) = 0.79
  • P(Horse|Mpg)
  • P(slow| low) = 0.95
  • P(slow|high) = 0.11
  • P(fast| low) = 0.05
  • P(fast|high) = 0.89
  • P(Accel|Horse)
  • The P(good | low, fast) = 0.75,
  • so we classify the example
  • as “good”.
  • …but that seems somehow familiar…
  • Wasn’t that the same answer as
  • P(Mpg=good | Horse=low)?

Bayes Classifiers

  • OK, so classification can be posed as inference
  • In fact, virtually all machine learning tasks are a form of inference
      • Anomaly detection: P(x)
      • Classification: P(Class | x)
      • Regression: P(Y | x)
      • Model Learning: P(Model | dataset)
      • Feature Selection: P(Model | dataset)

The Naïve Bayes Classifier

  • ASSUMPTION:
  • all the attributes are conditionally independent
  • given the class variable

The Naïve Bayes Advantage

  • At least 256 parameters!
  • You better have the data
  • to support them…
  • A mere 25 parameters!
  • (the CPTs are tiny because the attribute
  • nodes only have one parent.)

What is the Probability Function of the Naïve Bayes?

  • P(Mpg,Cylinders,Weight,Maker,…)
  • =
  • P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …

What is the Probability Function of the Naïve Bayes?

Bayes Classifier Results: “MPG”: 392 records

  • The Classifier learned by “Naive BC”

Bayes Classifier Results: “MPG”: 40 records

More Facts About Bayes Classifiers

  • Many other density estimators can be slotted in
  • Density estimation can be performed with real-valued inputs
  • Bayes Classifiers can be built with real-valued inputs
  • Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on
  • Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help.
  • Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

Summary

  • Axioms of Probability
  • Bayes nets are created by
      • chain rule
      • conditional independence
  • Bayes Nets can do
      • Inference
      • Anomaly Detection
      • Classification
  • The Axioms Of Probability


The database is protected by copyright ©sckool.org 2016
send message

    Main page