Nanos gigantium humeris insidentes!

Naive Bayes classifier Probabilistic model

  • October 14, 2013 8:08 pm

Abstractly, the probability model for a classifier is a conditional model.
p(C \vert F_1,\dots,F_n)\,
over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables F_1 through F_n. The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable.

Using Bayes’ theorem, this can be written
p(C \vert F_1,\dots,F_n) = \frac{p(C) \ p(F_1,\dots,F_n\vert C)}{p(F_1,\dots,F_n)}. \,
In plain English the above equation can be written as
\mbox{posterior} = \frac{\mbox{prior} \times \mbox{likelihood}}{\mbox{evidence}}. \,
In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C and the values of the features F_i are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model
p(C, F_1, \dots, F_n)\,
which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

(1)   \begin{align*} p(C, F_1, \dots, F_n)\, &∝ p(C)\ p(F_1,\dots,F_n\vert C) \\ &∝ p(C) \ p(F_1\vert C) \ p(F_2,\dots,F_n\vert C, F_1) \\ &∝ p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3,\dots,F_n\vert C, F_1, F_2) \\ &∝ p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ p(F_3\vert C, F_1, F_2) \ p(F_4,\dots,F_n\vert C, F_1, F_2, F_3) \\ &∝ p(C) \ p(F_1\vert C) \ p(F_2\vert C, F_1) \ \dots p(F_n\vert C, F_1, F_2, F_3,\dots,F_{n-1}). \end{align*}

Now the “naive” conditional independence assumptions come into play: assume that each feature F_i is conditionally independent of every other feature F_j for j\neq i given the category C. This means that
p(F_i \vert C, F_j) = p(F_i \vert C);  p(F_i \vert C, F_j,F_k) = p(F_i \vert C); p(F_i \vert C, F_j,F_k,F_l) = p(F_i \vert C)\,, and so on,
for i\ne j,k,l, and so the joint model can be expressed as

(2)    \begin{align*} p(C \vert F_1, \dots, F_n) & ∝ p(C, F_1, \dots, F_n) \\ & ∝ p(C) \ p(F_1\vert C) \ p(F_2\vert C) \ p(F_3\vert C) \ \cdots \\ & ∝ p(C) \prod_{i=1}^n p(F_i \vert C)\,. \end{align*}

This means that under the above independence assumptions, the conditional distribution over the class variable C is:
p(C \vert F_1,\dots,F_n) = \frac{1}{Z} p(C) \prod_{i=1}^n p(F_i \vert C)
where Z (the evidence) is a scaling factor dependent only on F_1,\dots,F_n, that is, a constant if the values of the feature variables are known.
Models of this form are much more manageable, because they factor into a so-called class prior p(C) and independent probability distributions p(F_i\vert C). If there are k classes and if a model for each p(F_i\vert C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k − 1) + n r k parameters. In practice, often k=2 (binary classification) and r=1 (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes model is 2n+1, where n is the number of binary features used for classification.


A short summary from Kevin Murphy.

Print Friendly