Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered online", "Accepts email notifications" etc) and continuous data (ex: "Age", "Length of membership" etc). I haven't used scikit much before but I suppose that that Gaussian Naive Bayes is suitable for continuous data and that Bernoulli Naive Bayes can be used for categorical data. However, since I want to have both categorical and continuous data in my model, I don't really know how to handle this. Any ideas would be much appreciated!

like image 438
user1499144 Avatar asked Jan 10 '13 09:01

user1499144


People also ask

Can you use Naive Bayes classifier for continuous data?

There are two ways to estimate the class-conditional probabilities for continuous attributes in naive Bayes classifiers: We can discretize each continuous attribute and then replace the continuous attribute value with its corresponding discrete interval.

Is Knn better than naive Bayes?

Naive bayes is much faster than KNN due to KNN's real-time execution.

What is multinomial Naive Bayes classifier?

The Multinomial Naive Bayes algorithm is a Bayesian learning approach popular in Natural Language Processing (NLP). The program guesses the tag of a text, such as an email or a newspaper story, using the Bayes theorem. It calculates each tag's likelihood for a given sample and outputs the tag with the greatest chance.

Can we use naive Bayes for Regression?

Naive Bayes classifier (Russell, & Norvig, 1995) is another feature-based supervised learning algorithm. It was originally intended to be used for classification tasks, but with some modifications it can be used for regression as well (Frank, Trigg, Holmes, & Witten, 2000) .


4 Answers

You have at least two options:

  • Transform all your data into a categorical representation by computing percentiles for each continuous variables and then binning the continuous variables using the percentiles as bin boundaries. For instance for the height of a person create the following bins: "very small", "small", "regular", "big", "very big" ensuring that each bin contains approximately 20% of the population of your training set. We don't have any utility to perform this automatically in scikit-learn but it should not be too complicated to do it yourself. Then fit a unique multinomial NB on those categorical representation of your data.

  • Independently fit a gaussian NB model on the continuous part of the data and a multinomial NB model on the categorical part. Then transform all the dataset by taking the class assignment probabilities (with predict_proba method) as new features: np.hstack((multinomial_probas, gaussian_probas)) and then refit a new model (e.g. a new gaussian NB) on the new features.

like image 64
ogrisel Avatar answered Oct 17 '22 05:10

ogrisel


Hope I'm not too late. I recently wrote a library called Mixed Naive Bayes, written in NumPy. It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features.

https://github.com/remykarem/mixed-naive-bayes

The library is written such that the APIs are similar to scikit-learn's.

In the example below, let's assume that the first 2 features are from a categorical distribution and the last 2 are Gaussian. In the fit() method, just specify categorical_features=[0,1], indicating that Columns 0 and 1 are to follow categorical distribution.

from mixed_naive_bayes import MixedNB
X = [[0, 0, 180.9, 75.0],
     [1, 1, 165.2, 61.5],
     [2, 1, 166.3, 60.3],
     [1, 1, 173.0, 68.2],
     [0, 2, 178.4, 71.0]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features=[0,1])
clf.fit(X,y)
clf.predict(X)

Pip installable via pip install mixed-naive-bayes. More information on the usage in the README.md file. Pull requests are greatly appreciated :)

like image 40
remykarem Avatar answered Oct 17 '22 03:10

remykarem


The simple answer: multiply result!! it's the same.

Naive Bayes based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features - meaning you calculate the Bayes probability dependent on a specific feature without holding the others - which means that the algorithm multiply each probability from one feature with the probability from the second feature (and we totally ignore the denominator - since it is just a normalizer).

so the right answer is:

  1. calculate the probability from the categorical variables.
  2. calculate the probability from the continuous variables.
  3. multiply 1. and 2.
like image 15
Yaron Avatar answered Oct 17 '22 03:10

Yaron


@Yaron's approach needs an extra step (4. below):

  1. Calculate the probability from the categorical variables.
  2. Calculate the probability from the continuous variables.
  3. Multiply 1. and 2. AND
  4. Divide 3. by the sum of the product of 1. and 2. EDIT: What I actually mean is that the denominator should be (probability of the event given the hypotnesis is yes) + (probability of evidence given the hypotnesis is no) (asuming a binary problem, without loss of generality). Thus, the probabilities of the hypotheses (yes or no) given the evidence would sum to 1.

Step 4. is the normalization step. Take a look at @remykarem's mixed-naive-bayes as an example (lines 268-278):

        if self.gaussian_features.size != 0 and self.categorical_features.size != 0:
            finals = t * p * self.priors
        elif self.gaussian_features.size != 0:
            finals = t * self.priors
        elif self.categorical_features.size != 0:
            finals = p * self.priors

        normalised = finals.T/(np.sum(finals, axis=1) + 1e-6)
        normalised = np.moveaxis(normalised, [0, 1], [1, 0])

        return normalised

The probabilities of the Gaussian and Categorical models (t and p respectively) are multiplied together in line 269 (line 2 in extract above) and then normalized as in 4. in line 275 (fourth line from the bottom in extract above).

like image 1
andreassot10 Avatar answered Oct 17 '22 05:10

andreassot10