Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Naive Bayes without Naive assumption

I'm trying to understand why the naive Bayes classifier is linearly scalable with the number of features, in comparison to the same idea without the naive assumption. I understand how the classifier works and what's so "naive" about it. I'm unclear as to why the naive assumption gives us linear scaling, whereas lifting that assumption is exponential. I'm looking for a walk-through of an example that shows the algorithm under the "naive" setting with linear complexity, and the same example without that assumption that will demonstrate the exponential complexity.

like image 985
dkv Avatar asked Oct 09 '16 16:10

dkv


People also ask

What would remove the naive assumption of Naive Bayes?

To overcome this issue, naive bayes algorithm assumes that all features are independent of each other. Furthermore, denominator (p(x1,x2, … , xn)) can be removed to simplify the equation because it only normalizes the value of conditional probability of a class given an observation ( p(yi | x1,x2, … , xn)).

What are the assumptions of Naive Bayes?

In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.

Why Naive Bayes is called naive because its assumption may or may not true?

Naive Bayes is called naive because it assumes that each input variable is independent. This is a strong assumption and unrealistic for real data; however, the technique is very effective on a large range of complex problems.

What is the critical assumption of Naive Bayes?

The Naive Bayes Classifier assumes that the presence of a particular feature is unrelated to the presence of any other feature. It's based on the Bayes' Theorem, which assumes the independence of predictors.


2 Answers

The problem here lies in following quantity

P(x1, x2, x3, ..., xn | y)

which you have to estimate. When you assume "naiveness" (feature independence) you get

P(x1, x2, x3, ..., xn | y) = P(x1 | y)P(x2 | y) ... P(xn | y)

and you can estimate each P(xi | y) independently. In a natural way, this approach scales linearly, since if you add another k features you need to estimate another k probabilities, each using some very simple technique (like counting objects with given feature).

Now, without naiveness you do not have any decomposition. Thus you you have to keep track of all probabilities of form

P(x1=v1, x2=v2, ..., xn=vn | y)

for each possible values of vi. In simplest case, vi is just "true" or "false" (event happened or not), and this already gives you 2^n probabilities to estimate (each possible assignment of "true" and "false" to a series of n boolean variables). Consequently you have exponential growth of the algorithm complexity. However, the biggest issue here is usually not computational one - but rather the lack of data. Since there are 2^n probabilities to estimate you need more than 2^n data points to have any estimate for all possible events. In real life you will not ever encounter dataset of size 10,000,000,000,000 points... and this is a number of required (unique!) points for 40 features with such an approach.

like image 180
lejlot Avatar answered Oct 26 '22 19:10

lejlot


Candy Selection

On the outskirts of Mumbai, there lived an old Grandma, whose quantitative outlook towards life had earned her the moniker Statistical Granny. She lived alone in a huge mansion, where she practised sound statistical analysis, shielded from the barrage of hopelessly flawed biases peddled as common sense by mass media and so-called pundits.

Every year on her birthday, her entire family would visit her and stay at the mansion. Sons, daughters, their spouses, her grandchildren. It would be a big bash every year, with a lot of fanfare. But what Grandma loved the most was meeting her grandchildren and getting to play with them. She had ten grandchildren in total, all of them around 10 years of age, and she would lovingly call them "random variables".

Every year, Grandma would present a candy to each of the kids. Grandma had a large box full of candies of ten different kinds. She would give a single candy to each one of the kids, since she didn't want to spoil their teeth. But, as she loved the kids so much, she took great efforts to decide which candy to present to which kid, such that it would maximize their total happiness (the maximum likelihood estimate, as she would call it).

But that was not an easy task for Grandma. She knew that each type of candy had a certain probability of making a kid happy. That probability was different for different candy types, and for different kids. Rakesh liked the red candy more than the green one, while Sheila liked the orange one above all else.

Each of the 10 kids had different preferences for each of the 10 candies.

Moreover, their preferences largely depended on external factors which were unknown (hidden variables) to Grandma.

If Sameer had seen a blue building on the way to the mansion, he'd want a blue candy, while Sandeep always wanted the candy that matched the colour of his shirt that day. But the biggest challenge was that their happiness depended on what candies the other kids got! If Rohan got a red candy, then Niyati would want a red candy as well, and anything else would make her go crying into her mother's arms (conditional dependency). Sakshi always wanted what the majority of kids got (positive correlation), while Tanmay would be happiest if nobody else got the kind of candy that he received (negative correlation). Grandma had concluded long ago that her grandkids were completely mutually dependent.

It was computationally a big task for Grandma to get the candy selection right. There were too many conditions to consider and she could not simplify the calculation. Every year before her birthday, she would spend days figuring out the optimal assignment of candies, by enumerating all configurations of candies for all the kids together (which was an exponentially expensive task). She was getting old, and the task was getting harder and harder. She used to feel that she would die before figuring out the optimal selection of candies that would make her kids the happiest all at once.

But an interesting thing happened. As the years passed and the kids grew up, they finally passed from teenage and turned into independent adults. Their choices became less and less dependent on each other, and it became easier to figure out what is each one's most preferred candy (all of them still loved candies, and Grandma).

Grandma was quick to realise this, and she joyfully began calling them "independent random variables". It was much easier for her to figure out the optimal selection of candies - she just had to think of one kid at a time and, for each kid, assign a happiness probability to each of the 10 candy types for that kid. Then she would pick the candy with the highest happiness probability for that kid, without worrying about what she would assign to the other kids. This was a super easy task, and Grandma was finally able to get it right.

That year, the kids were finally the happiest all at once, and Grandma had a great time at her 100th birthday party. A few months following that day, Grandma passed away, with a smile on her face and a copy of Sheldon Ross clutched in her hand.

Takeaway: In statistical modelling, having mutually dependent random variables makes it really hard to find out the optimal assignment of values for each variable that maximises the cumulative probability of the set.

You need to enumerate over all possible configurations (which increases exponentially in the number of variables). However, if the variables are independent, it is easy to pick out the individual assignments that maximise the probability of each variable, and then combine the individual assignments to get a configuration for the entire set.

In Naive Bayes, you make the assumption that the variables are independent (even if they are actually not). This simplifies your calculation, and it turns out that in many cases, it actually gives estimates that are comparable to those which you would have obtained from a more (computationally) expensive model that takes into account the conditional dependencies between variables.

I have not included any math in this answer, but hopefully this made it easier to grasp the concept behind Naive Bayes, and to approach the math with confidence. (The Wikipedia page is a good start: Naive Bayes).

Why is it "naive"?

The Naive Bayes classifier assumes that X|YX|Y is normally distributed with zero covariance between any of the components of XX. Since this is a completely implausible assumption for any real problem, we refer to it as naive.

Naive Bayes will make the following assumption:

If you like Pickles, and you like Ice Cream, naive bayes will assume independence and give you a Pickle Ice Cream and think that you'll like it.

Which is may not be true at all.

For a mathematical example see: https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-explained/

like image 27
Satyam Avatar answered Oct 26 '22 19:10

Satyam