## Naive Bayes classifier Probabilistic model

Abstractly, the probability model for a classifier is a conditional model.

over a dependent class variable C with a small number of outcomes or classes, conditional on several feature variables through . The problem is that if the number of features n is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable.

Using Bayes’ theorem, this can be written

In plain English the above equation can be written as

In practice, there is interest only in the numerator of that fraction, because the denominator does not depend on C and the values of the features are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model

which can be rewritten as follows, using the chain rule for repeated applications of the definition of conditional probability:

(1)

Now the “naive” conditional independence assumptions come into play: assume that each feature is conditionally independent of every other feature for given the category C. This means that

, and so on,

for and so the joint model can be expressed as

(2)

This means that under the above independence assumptions, the conditional distribution over the class variable C is:

where Z (the evidence) is a scaling factor dependent only on , that is, a constant if the values of the feature variables are known.

Models of this form are much more manageable, because they factor into a so-called class prior p(C) and independent probability distributions . If there are k classes and if a model for each can be expressed in terms of r parameters, then the corresponding naive Bayes model has (k − 1) + n r k parameters. In practice, often k=2 (binary classification) and r=1 (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes model is 2n+1, where n is the number of binary features used for classification.