How can I use sklearn.naive_bayes with (multiple) categorical features? [closed]

Tags:

I want to learn a Naive Bayes model for a problem where the class is boolean. Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).

If all my features were boolean then I would want to use sklearn.naive_bayes.BernoulliNB. It seems clear that sklearn.naive_bayes.MultinomialNB is not what I want.

One solution is to split up my categorical features into boolean features. For instance, if a variable "X" takes on values "red", "green", "blue", I can have three variables: "X is red", "X is green", "X is blue". That violates the assumption of conditional independence of the variables given the class, so it seems totally inappropriate.

Another possibility is to encode the variable as a real-valued variable where 0.0 means red, 1.0 means green, and 2.0 means blue. That also seems totally inappropriate to use GaussianNB (for obvious reasons).

I don't understand how to fit what I am trying to do into the Naive Bayes models that sklearn gives me.

[Edit to explain why I don't think multinomial NB is what I want]:

My understanding is that in multinomial NB the feature vector consists of counts of how many times a token was observed in k iid samples.

My understanding is that this is a fit for document of classification where there is an underlying class of document, and then each word in the document is assumed to be drawn from a categorical distribution specific to that class. A document would have k tokens, the feature vector would be of length equal to the vocabulary size, and the sum of the feature counts would be k.

In my case, I have a number of bernoulli variables, plus a couple categorical ones. But there is no concept of the "counts" here.

Example: classes are people who like or don't like math. Predictors are college major (categorical) and whether they went to graduate school (boolean).

I don't think this fits multinomial since there are no counts here.

589

asked Jul 27 '16 18:07

Ned Ruggeri

2 Answers

Some of the features are boolean, but other features are categorical and can take on a small number of values (~5).

This is an interesting question, but it is actually more than a single one:

How to deal with a categorical feature in NB.
How to deal with non-homogeneous features in NB (and, as I'll point out in the following, even two categorical features are non-homogeneous).
How to do this in sklearn.

Consider first a single categorical feature. NB assumes/simplifies that the features are independent. Your idea of transforming this into several binary variables is exactly that of dummy variables. Clearly, these dummy variables are anything but independent. Your idea of then running a Bernoulli NB on the result implicitly assumes independence. While it is known that, in practice, NB does not necessarily break when faced with dependent variables, there is no reason to try to transform the problem into the worst configuration for NB, especially as multinomial NB is a very easy alternative.

Conversely, suppose that after transforming the single categorical variable into a multi-column dataset using the dummy variables, you use a multinomial NB. The theory for multinomial NB states:

With a multinomial event model, samples (feature vectors) represent the frequencies with which certain events have been generated by a multinomial ... where p i is the probability that event i occurs. A feature vector ... is then a histogram, with x i {\displaystyle x_{i}} x_{i} counting the number of times event i was observed in a particular instance. This is the event model typically used for document classification, with events representing the occurrence of a word in a single document (see bag of words assumption).

So, here, each instance of your single categorical variable is a "length-1 paragraph", and the distribution is exactly multinomial. Specifically, each row has 1 in one position and 0 in all the rest because a length-1 paragraph must have exactly one word, and so those will be the frequencies.

Note that from the point of view of sklearn's multinomial NB, the fact that the dataset is 5-columned, does not now imply an assumption of independence.

Now consider the case where you have a dataset consisting of several features:

Categorical
Bernoulli
Normal

Under the very assumption of using NB, these variables are independent. Consequently, you can do the following:

Build a NB classifier for each of the categorical data separately, using your dummy variables and a multinomial NB.
Build a NB classifier for all of the Bernoulli data at once - this is because sklearn's Bernoulli NB is simply a shortcut for several single-feature Bernoulli NBs.
Same as 2 for all the normal features.

By the definition of independence, the probability for an instance, is the product of the probabilities of instances by these classifiers.

181

answered Oct 08 '22 06:10

Ami Tavory

CategoricalNB by scikit-learn is a new class to be added in the naive_bayes module. It's in the nightly build here.
Mixed Naive Bayes (https://github.com/remykarem/mixed-naive-bayes). It can assume a mix of Gaussian and categorical (multinoulli) distributions on the training data features. The library is written such that the APIs are similar to scikit-learn's.

from mixed_naive_bayes import MixedNB
X = [[0, 0],
     [1, 1],
     [2, 1],
     [1, 1],
     [0, 2]]
y = [0, 0, 1, 1, 0]
clf = MixedNB(categorical_features='all')
clf.fit(X,y)
clf.predict(X)

See my response in a similar question here https://stackoverflow.com/a/58428035/4570466.

answered Oct 08 '22 07:10

remykarem

Related questions
                            
                                How to center align a flexbox container? [duplicate]
                            
                                Static const variable declaration in a header file
                            
                                Small object stack storage, strict-aliasing rule and Undefined Behavior
                            
                                Make a non-blocking request with requests when running Flask with Gunicorn and Gevent
                            
                                Typescript without module loader?
                            
                                git blame on windows reports "fatal: no such path <path> in HEAD"
                            
                                Telegram bot message read callback
                            
                                Console.WriteLine does not output to Output Window in VS 2017
                            
                                How to test a Connexion/Flask app?
                            
                                Jest es6 modules: unexpected module import
                            
                                Missing Create GUID in Visual Studio Enterprise 2017?
                            
                                What is the Javascript equivalent of Python's get method for dictionaries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I use sklearn.naive_bayes with (multiple) categorical features? [closed]

Tags:

machine-learning

statistics

naivebayes

scikit-learn

Ned Ruggeri

People also ask

2 Answers

Ami Tavory

remykarem

Recent Activity

Donate For Us