# Machine Learning and the Axioms of Probability

 Date conversion 04.11.2017 Size 55,56 Kb.

## Machine Learning and the Axioms of Probability

• Brigham S. Anderson
• www.cs.cmu.edu/~brigham
• brigham@cmu.edu
• School of Computer Science
• Carnegie Mellon University

## Probability

• The world is a very uncertain place
• 30 years of Artificial Intelligence and Database research danced around this fact
• And then a few AI researchers decided to use some ideas from the eighteenth century

## What we’re going to do

• We will review the fundamentals of probability.
• It’s really going to be worth it
• You’ll see examples of probabilistic analytics in action:
• Inference,
• Anomaly detection, and
• Bayes Classifiers

## Discrete Random Variables

• A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.
• Examples
• A = The US president in 2023 will be male
• A = You wake up tomorrow with a headache
• A = You have influenza

## Probabilities

• We write P(A) as “the probability that A is true”
• We could at this point spend 2 hours on the philosophy of this.
• We’ll spend slightly less...

## Sample Space

• Definition 1.
• The set, S, of all possible outcomes of a particular experiment is called the sample space for the experiment
• The elements of the sample space are called outcomes.

## Sample Spaces

• Sample space of a coin flip:
• S = {H, T}
• H
• T

## Sample Spaces

• Sample space of a die roll:
• S = {1, 2, 3, 4, 5, 6}

## Sample Spaces

• Sample space of three die rolls?
• S = {111,112,113,…,
• …,664,665,666}

## Sample Spaces

• Sample space of a single draw from a deck of cards:
• …,Ks,Kc,Kd,Kh}

## So Far…

 Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is an element of the sample space. 2c

## Events

• Definition 2.
• An event is any subset of S (including S itself)

## Events

• Event: “Jack”
• Sample Space of card draw
• The Sample Space is the set of all outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes

## Events

• Event: “Hearts”
• Sample Space of card draw
• The Sample Space is the set of all outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes

## Events

• Event: “Red and Face”
• Sample Space of card draw
• The Sample Space is the set of all outcomes.
• An Outcome is a possible world.
• An Event is a set of outcomes

## Definitions

 Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 2c An event is a set of outcomes from the sample space. {2h,2c,2s,2d}

## Events

• Definition 3.
• Two events A and B are mutually exclusive if A^B=Ø.
• Definition 4.
• If A1, A2, … are mutually exclusive and A1 A2 … = S, then the collection A1, A2, … forms a partition of S.
• clubs
• hearts
• diamonds

## Probability

• Definition 5.
• Given a sample space S, a probability function is a function that maps each event in S to a real number, and satisfies
• P(A) ≥ 0 for any event A in S
• P(S) = 1
• For any number of mutually exclusive events
• A1, A2, A3 …, we have
• P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +…
• *
• *
• This definition of the domain of this function is not 100% sufficient, but it’s close enough for our purposes… (I’m sparing you Borel Fields)

## Definitions

 Definition Example The sample space is the set of all possible worlds. {As,Ac,Ah,Ad,2s,2c,2h,… …,Ks,Kc,Kd,Kh} An outcome is a single point in the sample space. 4c An event is a set of one or more outcomes Card is “Red” P(E) maps event E to a real number and satisfies the axioms of probability P(Red) = 0.50 P(Black) = 0.50

## Misconception

• A
• ~A
• The relative area of the events determines their probability
• …in a Venn diagram it does, but not in general.
• However, the “area equals probability” rule is guaranteed to result in axiom-satisfying probability functions.
• We often assume, for example, that the probability of “heads” is equal to “tails” in absence of other information…
• But this is totally outside the axioms!

## Creating a Valid P()

• One convenient way to create an axiom-satisfying probability function:
• Assign a probability to each outcome in S
• Make sure they sum to one
• Declare that P(A) equals the sum of outcomes in event A

## Everyday Example

• Assume you are a doctor.
• This is the sample space of “patients you might see on any given day”.
• Non-smoker, female, diabetic, headache, good insurance, etc…
• Smoker, male, herniated disk, back pain, mildly schizophrenic, delinquent medical bills, etc…
• Outcomes

## Everyday Example

• Number of elements in the “patient space”:
• 100 jillion
• Are these patients equally likely to occur?
• Again, generally not. Let’s assume for the moment that they are, though.
• …which roughly means “area equals probability”

## Everyday Example

• F
• Event: Patient has Flu
• Size of set “F”:
• 2 jillion
• (Exactly 2 jillion of the points in the sample space have flu.)
• Size of “patient space”:
• 100 jillion
• = 0.02
• PpatientSpace(F) =

## Everyday Example

• F
• = 0.02
• PpatientSpace(F) =
• From now on, the subscript on P() will
• be omitted…

## These Axioms are Not to be Trifled With

• There have been attempts to do different methodologies for uncertainty
• Fuzzy Logic
• Three-valued logic
• Dempster-Shafer
• Non-monotonic reasoning
• But the axioms of probability are the only system with this property:
• If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]

## Theorems from the Axioms

• Axioms
• P(A) ≥ 0 for any event A in S
• P(S) = 1
• For any number of mutually exclusive events
• A1, A2, A3 …, we have
• P(A1 A2 A3 …) = P(A1) + P(A2) + P(A3) +…
• Theorem.
• If P is a probability function and A is an event in S, then
• P(~A) = 1 – P(A)
• Proof:
• (1) Since A and ~A partition S, P(A ~A) = P(S) = 1
• (2) Since A and ~A are disjoint, P(A ~A) = P(A) + P(~A)
• Combining (1) and (2) gives the result

## Multivalued Random Variables

• Suppose A can take on more than 2 values
• A is a random variable with arity k if it can take on exactly one value out of {A1,A2, ... Ak}, and
• The events {A1,A2,…,Ak} partition S, so

## Elementary Probability in Pictures

• P(~A) + P(A) = 1
• A
• ~A

## Elementary Probability in Pictures

• P(B) = P(B, A) + P(B, ~A)
• A
• ~A
• B

• A1
• A2
• A3

• A1
• A2
• A3
• B
• Useful!

## Conditional Probability

• Assume once more that you are a doctor.
• Again, this is the sample space of “patients you might see on any given day”.

• F
• Event: Flu
• P(F) = 0.02

• P(H) = 0.10
• H
• F

## Conditional Probability

• P(F) = 0.02
• P(H) = 0.10
• …we still need to specify the interaction between flu and headache…
• Define
• P(H|F) = Fraction of F’s outcomes which are also in H
• H
• F

## Conditional Probability

• H
• F
• P(F) = 0.02
• P(H) = 0.10
• P(H|F) = 0.50
• 0.01
• 0.01
• 0.89
• 0.09
• F = “flu”

## Conditional Probability

• F = “flu”
• P(H|F) = Fraction of flu worlds in which patient has a headache
• = #worlds with flu and headache
• ------------------------------------
• #worlds with flu
• = Size of “H and F” region
• ------------------------------
• Size of “F” region
• = P(H, F)
• ----------
• P(F)
• F
• 0.01
• 0.01
• 0.89
• 0.09
• H

## Conditional Probability

• Definition.
• If A and B are events in S, and P(B) > 0, then the conditional probability of A given B, written P(A|B), is
• The Chain Rule
• A simple rearrangement of the above equation yields
• Main Bayes
• Net concept!

## Probabilistic Inference

• H = “Have a headache”
• F = “Coming down with Flu”
• P(H) = 0.10
• P(F) = 0.02
• P(H|F) = 0.50
• One day you wake up with a headache. You think: “Drat! 50% of flus are associated with headaches so I must have a 50-50 chance of coming down with flu”
• Is this reasoning good?
• H
• F

## Probabilistic Inference

• H
• F
• H = “Have a headache”
• F = “Coming down with Flu”
• P(H) = 0.10
• P(F) = 0.02
• P(H|F) = 0.50

## What we just did…

• P(A,B) P(A|B) P(B)
• P(B|A) = ----------- = ---------------
• P(A) P(A)
• This is Bayes Rule
• Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

## Independence

• Definition.
• Two events, A and B, are statistically independent if
• Which is equivalent to
• Important for
• Bayes Nets

## Representing P(A,B,C)

• How can we represent the function P(A)?
• P(A,B)?
• P(A,B,C)?

## The Joint Probability Table

• Recipe for making a joint distribution of M variables:
• Make a truth table listing all combinations of values of your variables (if there are M boolean variables then the table will have 2M rows).
• For each combination of values, say how probable it is.
• If you subscribe to the axioms of probability, those numbers must sum to 1.
 A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10
• Example: P(A, B, C)
• A
• B
• C
• 0.05
• 0.25
• 0.10
• 0.05
• 0.05
• 0.10
• 0.10
• 0.30

## Using the Joint

• One you have the JPT you can ask for the probability of any logical expression
• …what is P(Poor,Male)?

## Using the Joint

• P(Poor, Male) = 0.4654
• …what is P(Poor)?

## Using the Joint

• P(Poor) = 0.7604
• …what is P(Poor|Male)?

## Inference with the Joint

• P(Male | Poor) = 0.4654 / 0.7604 = 0.612

## Inference is a big deal

• I’ve got this evidence. What’s the chance that this conclusion is true?
• I’ve got a sore neck: how likely am I to have meningitis?
• I see my lights are out and it’s 9pm. What’s the chance my spouse is already asleep?
• There’s a thriving set of industries growing based around Bayesian Inference. Highlights are: Medicine, Pharma, Help Desk Support, Engine Fault Diagnosis

## Where do Joint Distributions come from?

• Idea One: Expert Humans
• Idea Two: Simpler probabilistic facts and some algebra
• Example: Suppose you knew
• P(A) = 0.5
• P(B|A) = 0.2
• P(B|~A) = 0.1
• P(C|A,B) = 0.1
• P(C|A,~B) = 0.8
• P(C|~A,B) = 0.3
• P(C|~A,~B) = 0.1
• Then you can automatically compute the JPT using the chain rule
• P(A,B,C) = P(A) P(B|A) P(C|A,B)
• Bayes Nets are a systematic way to do this.

## Where do Joint Distributions come from?

• Idea Three: Learn them from data!
• Prepare to witness an impressive learning algorithm….

## Learning a JPT

• Then fill in each row with
 A B C Prob 0 0 0 ? 0 0 1 ? 0 1 0 ? 0 1 1 ? 1 0 0 ? 1 0 1 ? 1 1 0 ? 1 1 1 ?
 A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 1 1 1 0.10
• Fraction of all records in which
• A and B are True but C is False

## Example of Learning a JPT

• This JPT was obtained by learning from three attributes in the UCI “Adult” Census Database [Kohavi 1995]

## Where are we?

• We have recalled the fundamentals of probability
• We have become content with what JPTs are and how to use them
• And we even know how to learn JPTs from data.

## Density Estimation

• Our Joint Probability Table (JPT) learner is our first example of something called Density Estimation
• A Density Estimator learns a mapping from a set of attributes to a Probability
• Density
• Estimator
• Probability
• Input
• Attributes

## Evaluating a density estimator

• Given a record x, a density estimator M can tell you how likely the record is:
• Given a dataset with R records, a density estimator can tell you how likely the dataset is:
• (Under the assumption that all records were independently generated from the probability function)

## A small dataset: Miles Per Gallon

• From the UCI repository (thanks to Ross Quinlan)

## A small dataset: Miles Per Gallon

• 192 Training Set Records

## A small dataset: Miles Per Gallon

• 192 Training Set Records

## Log Probabilities

• Since probabilities of datasets get so small we usually use log probabilities

## A small dataset: Miles Per Gallon

• 192 Training Set Records

## Summary: The Good News

• The JPT allows us to learn P(X) from data.
• Can do inference: P(E1|E2)
• Automatic Doctor, Recommender, etc
• Can do anomaly detection
• spot suspicious / incorrect records
• (e.g., credit card fraud)
• Can do Bayesian classification
• Predict the class of a record
• (e.g, predict cancerous/not-cancerous)

• Density estimation with JPTs is trivial, mindless and dangerous

## Using a test set

• An independent test set with 196 cars has a much worse log likelihood than it had on the training set
• (actually it’s a billion quintillion quintillion quintillion quintillion times less likely)
• ….Density estimators can overfit. And the JPT estimator is the overfittiest of them all!

## Overfitting Density Estimators

• If this ever happens, it means there are certain combinations that we learn are “impossible”

## Using a test set

• The only reason that our test set didn’t score -infinity is that Andrew’s code is hard-wired to always predict a probability of at least one in 1020
• We need Density Estimators that are less prone to overfitting

## Is there a better way?

• The problem with the JPT is that it just mirrors the training data.
• In fact, it is just another way of storing the data: we could reconstruct the original dataset perfectly from it!
• We need to represent the probability function with fewer parameters…

## Bayes Nets

• What are they?
• Bayesian nets are a framework for representing and analyzing models involving uncertainty
• What are they used for?
• Intelligent decision aids, data fusion, 3-E feature recognition, intelligent diagnostic aids, automated free text understanding, data mining
• How are they different from other knowledge representation and probabilistic analysis tools?
• Uncertainty is handled in a mathematically rigorous yet efficient and simple way

## Bayes Net Concepts

• Chain Rule
• P(A,B) = P(A) P(B|A)
• Conditional Independence
• P(A|B,C) = P(A|B)

## A Simple Bayes Net

• Let’s assume that we already have P(Mpg,Horse)
• How would you rewrite this using the Chain rule?
• 0.48
• 0.12
• 0.04
• 0.36
• good
• high
• low
• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse) =

## Review: Chain Rule

• 0.48
• 0.12
• 0.04
• 0.36
• good
• high
• low
• P(Mpg, Horse)
• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse)
• P(good) = 0.4
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Mpg)
• P(Horse|Mpg)
• *

## Review: Chain Rule

• 0.48
• 0.12
• 0.04
• 0.36
• good
• high
• low
• P(Mpg, Horse)
• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse)
• P(good) = 0.4
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Mpg)
• P(Horse|Mpg)
• *
• = P(good) * P(low|good) = 0.4 * 0.89
• = P(good) * P(high|good) = 0.4 * 0.11
• = 0.6 * 0.21
• = 0.6 * 0.79

## How to Make a Bayes Net

• P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
• Mpg
• Horse

## How to Make a Bayes Net

• P(Mpg, Horse) = P(Mpg) * P(Horse | Mpg)
• Mpg
• Horse
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.90
• P( low| bad) = 0.21
• P(high|good) = 0.10
• P(Horse|Mpg)

## How to Make a Bayes Net

• Mpg
• Horse
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.90
• P( low| bad) = 0.21
• P(high|good) = 0.10
• P(Horse|Mpg)
• Each node is a probability function
• Each arc denotes conditional dependence

## How to Make a Bayes Net

• So, what have we accomplished thus far?
• Nothing;
• we’ve just “Bayes Net-ified” the P(Mpg, Horse) JPT using the Chain rule.
• …the real excitement starts when we wield conditional independence
• Mpg
• Horse
• P(Mpg)
• P(Horse|Mpg)

## How to Make a Bayes Net

• Before we continue, we need a worthier opponent than puny P(Mpg, Horse)…
• We’ll use P(Mpg, Horse, Accel):
• P(good, low,slow) = 0.37
• P(good, low,fast) = 0.01
• P(good,high,slow) = 0.02
• P(good,high,fast) = 0.00
• P( bad, low,slow) = 0.10
• P( bad, low,fast) = 0.12
• P(Mpg,Horse,Accel)
• * Note: I made these up…

## How to Make a Bayes Net

• Step 1: Rewrite joint using the Chain rule.
• P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)
• Note:
• Obviously, we could have written this 3!=6 different ways…
• P(M, H, A) = P(M) * P(H|M) * P(A|M,H)
• = P(M) * P(A|M) * P(H|M,A)
• = P(H) * P(M|H) * P(A|H,M)
• = P(H) * P(A|H) * P(M|H,A)
• = …
• = …

## How to Make a Bayes Net

• Mpg
• Horse
• Accel
• Step 1: Rewrite joint using the Chain rule.
• P(Mpg, Horse, Accel) = P(Mpg) P(Horse | Mpg) P(Accel | Mpg, Horse)

## How to Make a Bayes Net

• Mpg
• Horse
• Accel
• P(Mpg)
• P(Horse|Mpg)
• P(Accel|Mpg,Horse)

## How to Make a Bayes Net

• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.90
• P( low| bad) = 0.21
• P(high|good) = 0.10
• P(Horse|Mpg)
• P(slow|good, low) = 0.97
• P(slow|good,high) = 0.15
• P(slow| bad, low) = 0.90
• P(fast|good, low) = 0.03
• P(fast|good,high) = 0.85
• P(fast| bad, low) = 0.10
• P(Accel|Mpg,Horse)
• * Note: I made these up too…

## How to Make a Bayes Net

• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Horse|Mpg)
• P(slow|good, low) = 0.97
• P(slow|good,high) = 0.15
• P(slow| bad, low) = 0.90
• P(fast|good, low) = 0.03
• P(fast|good,high) = 0.85
• P(fast| bad, low) = 0.10
• P(Accel|Mpg,Horse)
• A Miracle Occurs!
• You are told by God (or another domain expert)
• that Accel is independent of Mpg given Horse!
• i.e., P(Accel | Mpg, Horse) = P(Accel | Horse)

## How to Make a Bayes Net

• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Horse|Mpg)
• P(slow| low) = 0.22
• P(slow|high) = 0.64
• P(fast| low) = 0.78
• P(fast|high) = 0.36
• P(Accel|Horse)

## How to Make a Bayes Net

• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Horse|Mpg)
• P(slow| low) = 0.22
• P(slow|high) = 0.64
• P(fast| low) = 0.78
• P(fast|high) = 0.36
• P(Accel|Horse)
• Thank you, domain expert!
• Now I only need to
• learn 5 parameters
• instead of 7 from my data!
• My parameter estimates
• will be more accurate as
• a result!

## Independence

• “The Acceleration does not depend on the Mpg once I know the Horsepower.”
• This can be specified very simply:
• P(Accel Mpg, Horse) = P(Accel | Horse)
• This is a powerful statement!
• It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

## Bayes Nets Formalized

• A Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:
• V is a set of vertices.
• E is a set of directed edges joining vertices. No loops of any length are allowed.
• Each vertex in V contains the following information:
• A Conditional Probability Table (CPT) indicating how this variable’s probabilities depend on all possible combinations of parental values.

## Bayes Nets Summary

• Bayes nets are a factorization of the full JPT which uses the chain rule and conditional independence.
• They can do everything a JPT can do (like quick, cheap lookups of probabilities)

## The good news

• We can do inference.
• We can compute any conditional probability:
• P( Some variable  Some other variable values )

## The good news

• We can do inference.
• We can compute any conditional probability:
• P( Some variable  Some other variable values )
• Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.
• How much work is the above computation?

• Doing inference “JPT-style” by enumerating all matching entries in the joint are expensive:
• Exponential in the number of variables.
• But perhaps there are faster ways of querying Bayes nets?
• In fact, if I ever ask you to manually do a Bayes Net inference, you’ll find there are often many tricks to save you time.
• So we’ve just got to program our computer to do those tricks too, right?
• General querying of Bayes nets is NP-complete.

## Case Study I

• Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).
• Diagnostic system for lymph-node diseases.
• 60 diseases and 100 symptoms and test-results.
• 14,000 probabilities
• Expert consulted to make net.
• 8 hours to determine variables.
• 35 hours for net topology.
• 40 hours for probability table values.
• Apparently, the experts found it quite easy to invent the causal links and probabilities.
• Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

## Bayes Net Info

• GUI Packages:
• Genie -- Free
• Netica -- \$\$
• Hugin -- \$\$
• Non-GUI Packages:
• All of the above have APIs
• BNT for MATLAB
• AUTON code (learning extremely large networks of tens of thousands of nodes)

## Bayes Nets and Machine Learning

• Classifier
• Data point x
• Anomaly
• Detector
• Data point x
• P(x)
• P(C | x)
• Inference
• Engine
• Evidence e1
• P(e2 | e1)
• Missing Variables e2

## What is an Anomaly?

• An irregularity that cannot be explained by simple domain models and knowledge
• Anomaly detection only needs to learn from examples of “normal” system behavior.
• Classification, on the other hand, would need examples labeled “normal” and “not-normal”

## Anomaly Detection in Practice

• Monitoring computer networks for attacks.
• Monitoring population-wide health data for outbreaks or attacks.
• Looking for suspicious activity in bank transactions
• Detecting unusual eBay selling/buying behavior.

## JPT Anomaly Detector

• Suppose we have the following model:
• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse) =
• We’re trying to detect anomalous cars.
• If the next example we see is , how anomalous is it?

## JPT Anomaly Detector

• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse) =
• How likely is
• ?
• Could not be easier! Just look up the entry in the JPT!
• Smaller numbers are more anomalous in that the
• model is more surprised to see them.

## Bayes Net Anomaly Detector

• How likely is
• ?
• Mpg
• Horse
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.90
• P( low| bad) = 0.21
• P(high|good) = 0.10
• P(Horse|Mpg)

## Bayes Net Anomaly Detector

• How likely is
• ?
• Mpg
• Horse
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.90
• P( low| bad) = 0.21
• P(high|good) = 0.10
• P(Horse|Mpg)
• Again, trivial!
• We need to do one tiny lookup
• for each variable in the network!

• Classifier
• Data point x
• Anomaly
• Detector
• Data point x
• P(x)
• P(C | x)
• Inference
• Engine
• Evidence e1
• P(E2 | e1)
• Missing Variables E2

• DT
• BC

## Bayes Classifiers in 1 Slide

• Bayes classifiers just do inference.
• That’s it.
• The “algorithm”
• Learn P(class,X)
• For a given input x, infer P(class|x)
• Choose the class with the highest probability

## JPT Bayes Classifier

• Suppose we have the following model:
• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse) =
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is Horse = “low”, how do we classify it?

## JPT Bayes Classifier

• P(good, low) = 0.36
• P(good,high) = 0.04
• P( bad, low) = 0.12
• P(Mpg, Horse) =
• How do we classify
• ?
• The P(good | low) = 0.75,
• so we classify the example
• as “good”

## Bayes Net Classifier

• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Horse|Mpg)
• P(slow| low) = 0.95
• P(slow|high) = 0.11
• P(fast| low) = 0.05
• P(fast|high) = 0.89
• P(Accel|Horse)
• We’re trying to classify cars as Mpg = “good” or “bad”
• If the next example we see is how do we classify it?

## Bayes Net Bayes Classifier

• Suppose we get a
• example?
• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Horse|Mpg)
• P(slow| low) = 0.95
• P(slow|high) = 0.11
• P(fast| low) = 0.05
• P(fast|high) = 0.89
• P(Accel|Horse)
• Note: this is not exactly 0.75 because I
• rounded some of the CPT numbers earlier…

## Bayes Net Bayes Classifier

• Mpg
• Horse
• Accel
• P(good) = 0.4
• P(Mpg)
• P( low|good) = 0.89
• P( low| bad) = 0.21
• P(high|good) = 0.11
• P(Horse|Mpg)
• P(slow| low) = 0.95
• P(slow|high) = 0.11
• P(fast| low) = 0.05
• P(fast|high) = 0.89
• P(Accel|Horse)
• The P(good | low, fast) = 0.75,
• so we classify the example
• as “good”.
• …but that seems somehow familiar…
• Wasn’t that the same answer as
• P(Mpg=good | Horse=low)?

## Bayes Classifiers

• OK, so classification can be posed as inference
• In fact, virtually all machine learning tasks are a form of inference
• Anomaly detection: P(x)
• Classification: P(Class | x)
• Regression: P(Y | x)
• Model Learning: P(Model | dataset)
• Feature Selection: P(Model | dataset)

## The Naïve Bayes Classifier

• ASSUMPTION:
• all the attributes are conditionally independent
• given the class variable

• At least 256 parameters!
• You better have the data
• to support them…
• A mere 25 parameters!
• (the CPTs are tiny because the attribute
• nodes only have one parent.)

## What is the Probability Function of the Naïve Bayes?

• P(Mpg,Cylinders,Weight,Maker,…)
• =
• P(Mpg) P(Cylinders|Mpg) P(Weight|Mpg) P(Maker|Mpg) …

## Bayes Classifier Results: “MPG”: 392 records

• The Classifier learned by “Naive BC”

## More Facts About Bayes Classifiers

• Many other density estimators can be slotted in
• Density estimation can be performed with real-valued inputs
• Bayes Classifiers can be built with real-valued inputs
• Rather Technical Complaint: Bayes Classifiers don’t try to be maximally discriminative---they merely try to honestly model what’s going on
• Zero probabilities are painful for Joint and Naïve. A hack (justifiable with the magic words “Dirichlet Prior”) can help.
• Naïve Bayes is wonderfully cheap. And survives 10,000 attributes cheerfully!

## Summary

• Axioms of Probability
• Bayes nets are created by
• chain rule
• conditional independence
• Bayes Nets can do
• Inference
• Anomaly Detection
• Classification
• The Axioms Of Probability