Consider the domain of dealing five card poker hands from a standard deck of 52 cards, under the assumption the dealer is fair.
How many atomic events are there in the joint probability distribution? (I.e., how many five card hands are there?)
This is simply the number of ways to choose five things from 52:
What is the probability of each atomic event?
Each hand is equally likely, so:
What is the probability of being dealt a royal straight flush?
There are four royal straight flushes, so:
What is the probability of being dealt four of a kind?
Each hand of four of a kind is distinguished by two things, the card there is four of and the card that is different. There are 13 different cards to have four of and for each given four of a kind there are 48 cards that are different. So, the number of ways to get four of a kind is:
The probability then would be:
Show that the statement:
is equivalent to either of the statements:
First, some notation. PA,B(a,b) is the joint probability of A and B:
The initial statement:
Is a statement that A and B are independent, that is:
For conditional probability in general:
This can be expanded to conditional joint probability:
Likewise:
Combining these two forms along with Bayes' rule gives:
Incorporating the initial assumption of independence:
Suppose you are a witness to a nighttime hit-and-run accident involving a taxi in Athens. All taxis in Athens are either blue or green. You swear, under oath, that the taxi was blue. Extensive testing shows that, under dim lighting conditions, discrimination between blue and green is 75% reliable.
Is it possible to calculate the most likely color for the taxi?
From the question we are given probabilities about the events:
The information given is:
The desired quantity is the most likely color for the taxi:
It is worth noting two facts:
Consider the probability that the taxi was in fact blue given it was perceived to be blue, using Bayes' rule: (If this is less than 0.5 then the taxi was probably green.)
All of the conditional probabilities are known, but P(tb) isn't, so the solution isn't known. This makes sense. Say for example there are 10,000 taxis in Athens. 9,999 are green and 1 is blue. The likelihood that I mistook one of the green ones for the blue one is much higher than the probability that I actually saw the one blue one.
How about if it is known that 9 out of 10 Athenian taxis are green?
This adds an additional known probability:
With this information it is possible to solve the previous equation:
So, there is only a probability that the taxi was blue given I thought it was blue. I.e., it was probably green.
Two astronomers in different parts of the world make measurements M1 and M2 of the number of stars, N, for a particular region of the sky. Normally, there is a small probability, e, that there will be an error of up to one star in either direction. Each telescope can also be badly out of focus (events F1 and F2) with much smaller probability f, in which case the scientist will undercount by three or more stars. If N is less than 3, they can fail to detect any stars at all. Consider the following Baysean networks:
Which of these Baysean networks are correct (but not necessarily efficient) representations of the question?
ii is clearly correct because a change in N or Fi will affect Mi. The statement of the question highly suggests that at least one of the other two is an inefficient solution.
i can be eliminated because Mi is only dependent on Fi. The count found by the first observer, M1, is very likely correlated to the count found by the second, M2.
iii doesn't have any conditional independences which is correct, though not terribly efficient.
Which is the best network?
ii is the most succinct network which requires the minimum of information to do computations.
Write out the conditional probability distribution for P(M1 | N), for the case where N ∈ {1,2,3} and M1 ∈ {0,1,2,3,4}. Each element should be expressed in terms of the parameters e and/or f.
If a telescope is out of focus it will undercount by at least three. Counting errors will cause an error of at most one, so it is not possible to undercount by two. Also it is not possible to overcount by more than one.
It is assumed that the likelihood of a counting error is equally likely in either direction, so:
M1 | P(M1 | N = 1) | P(M1 | N = 2) | P(M1 | N = 3) |
---|---|---|---|
0 | 0 | f | |
1 | 1 - e | 0 | |
2 | 1 - e | ||
3 | 0 | 1 - e - f | |
4 | 0 | 0 |
Suppose M1 = 1 and M2 = 3. What are the possible number of stars if there is no prior constraint on the value of N?
Since it is possible to undercount by an arbitrary number of stars, there is no maximum possible. Both M1 and M2 could be underestimating by 100,000 stars for all that is specified in the problem. The question is the minimum number. It is only possible to overcount by 1, so the minimum set by M2 is 2.
What is the most likely number of stars given M1 = 1 and M2 = 3? If it is not possible to compute this, explain what additional information is needed and how it would affect the result.
The likelihood that the telescopes are out of focus and a serious overcount is happening is constant at f. If this is the case then anything above N = 6 (M2 + 3) has the same probability:
N | M1 - N | M2 - N | P(N | M1 = 1) | P(N | M2 = 3) | P(N | M1 = 1,M2 = 3) |
---|---|---|---|---|---|
≤ 1 | ≥ 0 | ≥ 2 | 01 | 0 | |
2 | -1 | 1 | |||
3 | -2 | 0 | 02 | 0 | |
4 | -3 | -1 | |||
5 | -4 | -2 | 02 | 0 | |
≥ 6 | ≥ -5 | ≥ -3 |
The most likely number of stars is 2. From the problem specification, it is stated that f is "much smaller" than e. This can be taken to assume:
Consider the following variable elimination algorithm:
Elimination-Ask
(X, e, bn) returns a distribution over X
Reverse
(Vars
[bn])Make-Factor
(X,e)|factors]SumOut
(var,factors)Normalize
(Pointwise-Product
(factors))Apply variable elimination to the query:
For the purposes of succinct representation, the full predicates are shortened:
There are also hidden variables:
The basics of this problem are predicated on the following Baysean network:
Recall, in general, for a Bayes' net:
The problem is considered in terms of the entire probability distribution, P, as opposed to the probability of a particular event, P. For a given:
So, for this example:
I really don't understand the . I thought e meant there was an earthquake whereas E represented the probability distribution. Why isn't the equation:
The syntax I'm going to use is:
The first step is to reconsider the problem in terms of the known conditional independences from the Baysean network:
Under the laws of summation, this is equivalent to:
The next steps involve working through the equation from right to left producing a set of factors. Factors are produced in several ways.
For , m is a fixed value, and the factor for a fixed value is simply:
Similarly for , j is a fixed value:
There is another notational change here. Russell and Norvig use:
Since is the pairing of j with the elements from A, I am going to use as the pairings of the elements of J with the elements of A. This is more intuitive since what would the notation mean in Russell and Norvig's representation? It is a factor of what with the elements of A? If J needs to be present in the argument list to represent its inclusion in the factor then doesn't make sense. So, for my notation, is a 2×2×2 matrix since there's a dimension for each axis. Specifically:
Where A↺B represents that A and B are the unstacked layers of a three-dimensional matrix. The above result is not useful in any way to the best of my knowledge, it simply serves to illustrate the meaning of the syntax and show the procedure for a three dimensional expansion.
For the next term, , dealing with the conditional probability on B and E, the factor is:
The next element in the equation is the summation over a. Factors may be combined with a type of summation where:
The operation × is the pointwise product. The pointwise product of two factors yields a new factor that is the union of the variables in the factors. Consider one of the elements in the previous equation: (Note that this operation divides the levels of fA(B,E) and reduces the dimensionality by a level.
The process is the same for ¬a, and that factor is:
The sum of those factors then is:
Note that the process of summing to reduced from a 2×2 matrix to a 2×1. In general, summing will remove a dimension from the matrix.
The next element in the equation is this is simply the probabilities of E not conditional on anything, and the factor is represented as:
This is also a deviation from Russell and Norvig who represent this quantity as .
The next element in the equation is a summation over e which, as before, produces:
Calculate the number of computations performed and compare that with the number performed by enumeration.
By counting:
Suppose a network has the form of a chain: a sequence of boolean variables X1,…,Xn where Parents
(Xi) = {Xi - 1} for i = 2,…,n.
What is the complexity of computing P(X1|Xn = true) using enumeration?
In general:
This equation can then be factored based on the knowledge in the Bayes' diagram:
Since |ki| = 2 ∀ k < n, the enumeration search tree will be a full binary tree of depth n - 1.
The number of non-leaf nodes in such a tree (and required computational complexity) is 2n - 2.
What is the complexity using elimination?
Each reduction will take two computations and there are n of them, so 2n2.
You're a security guard in an underground installation. You want to know if it's raining on a given day, but your only exposure to the outside world occurs when the director arrives each morning either with, or without an umbrella. Thus, for each day, t, the set of evidince varialbes, Et, contains a single element ut representing the presence or absence of an umbrella. The set of unobservable state variables, Xt, contains a single variable rt representing whether it is raining or not.
Furthermore, assume the following observed probabilities:
rt - 1 | P(rt = T|rt - 1) |
---|---|
T | 0.7 |
F | 0.3 |
rt | P(ut = T|rt) |
---|---|
T | 0.9 |
F | 0.2 |
Suppose you observe an unending sequence of days on which the umbrella appears. Show that, as the days go by, the probability of rain on the current day increases monotonically toward a fixed point. What is that fixed point?
An examination of this question requires an application of chaining for a first-order Markov process which states:
This function allows the probability of a given event to be built up from a given set of evidence. Like any chain however, it must have an end. In this case, that end is P(R0) = <P(r0),P(¬r0)> which is the probability distribution for rain on day 0.
For the sake of simplicity, P(R0) will be assumed to be fixed at <0.5,0.5>. If, as the question asks, the distribution is converging on a fixed value, then the start point for the convergance should not matter.
The calculations start off with a computation of the probability distribution for rain on day 1 which is slightly different from other calculations because P(R0|u0) is defined as P(R0).
Now, this value may be used for the first link in the chain:
The next link progresses in the same way, but u1 is now defined and meaningful:
The next link in the chain is:
To do one more iteration to solidify the pattern:
The effect of having an umbrealla present every day means that the factors remain the same. The actual math has to date eluded me, but this program will estimate the converged value to an arbitrary number of decimal places. The value it comes up with is, to 50 places:
<0.89674554944846594496715149092767174611258015717335, 0.10325445055153405503284850907232825388741984282665>
Consider forecasting further and further into the future, given just the first two umbrella observations.
Compute the probability P(rk + 2|u1, u2) for 1 ≤ k ≤ 20 and plot the results.
The value of is somewhat open to debate. In the previous question the value at t = 2 was based on an initial 50/50 probability of rain. Using that assumption:
You should see that this probability converges toward a fixed point. What is this point?
<0.5,0.5>