Generative Classifier And Discriminative Classifier
13 Apr 2020Recently I was learning the naive bayes classifier and learnt there are two categories of classifiers to estimate the distribution P(Y|X) based on the given data, Generative Classifier and Discriminative Classifier.
I am puzzled on the differences between these two classifiers at the beginning, since their goals are same and their equations are similar. Besides, I also found there are two estimation methods, namely MAP (Maximum A Posteriori) and MLE(Maximum Likelihood Estimation) show up quite often when using these two classifiers. What’s the relationships between them?
So, I will try to sum up some of my findings during my exploring the correct path to understand these two classifiers and two estimation methods in this post.
Overview
There is a dataset D={(x1,y1),…,(xn,yn)} drawn from distribution P(X,Y), and our purpose is to estimate P(Y|X) so that we can predicate the label y from any features x.
Both Generative Classifier and Discriminative Classifier are used to learn the conditional probability P(Y|X). But their way to learn that is different. A Generative Classifier first learns the joint probability P(X,Y) and uses the Bayes Theorem to calculate the conditional probability P(Y|X) while A Discriminative Classifier learns the P(Y|X) directly.
And MLE and MAP are the method used to learn the joint probability or conditional probability we mentioned in that two classifiers.
Generative Classifier
In Generative Classifier, the joint probability P(X,Y) is calculated by P(X,Y)=P(Y)P(X|Y)∑Y′P(Y′)P(X|Y′)
So we need to learn P(Y) and P(X|Y) first, the followings are the steps
-
Assume some functional form for P(Y) and P(X|Y)
- Assume P(Y)≈P(Y|θ)=f(Y,θ). θ represents the parameters or hypothesis we assumed for the real distributoin P(X,Y).
- There is a debat for the explanation of θ here between frequentist and Bayesian inference. Bayesian approach believes that θ is a random variable which can have its own distribution P(θ), meanwhile frequentist approach thinks θ is a inner unknown variable which associated with the distribution P(X,Y).
- f is a function assumed by us.
- The same to P(X|Y), assume P(X|Y)≈P(X|Y,θ)=g(X,Y,θ).
- Estimate parameters θ of P(Y|θ) and P(X|Y,θ) based on the dataset D.
- We can use either MLE or MAP to learn
- If use MLE, we are trying to find the θ=argmaxθP(D|θ)
- If use MAP, we are trying to find the θ=argmaxθP(θ|D)=argmaxθP(D|θ)P(θ)
- If use MAP, we will be Bayesian since we believe there is a distribution P(θ) for θ.
- Use Bayes rule to calculate P(X|Y)
Discriminative Classifier
In Discriminative Classifier, the conditional probability P(Y|X) is learnt directly, the followings are the steps
- Assume some functional form for P(Y|X)
- Assume P(Y|X)≈P((Y|X,θ)=f(X,Y,θ).
- Estimate parameters θ of P(Y|X,θ) based on the dataset D
- We can use either MLE or MAP to learn too, the detail method is same as the methods in Generative Classifier.
Reference
https://machinelearningmastery.com/logistic-regression-with-maximum-likelihood-estimation/ http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote04.html https://medium.com/@mlengineer/generative-and-discriminative-models-af5637a66a3