Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MultinomialNB error: "Unknown Label Type"

I have two numpy arrays, X_train and Y_train, where the first of dimensions (700,1000) is populated by the values 0, 1, 2, 3, 4, and 10. The second of dimensions (700,) is populated by the values 'fresh' or 'rotten', since I'm working with Rotten Tomatoes's API. For some reason, when I execute:

nb = MultinomialNB()
nb.fit(X_train, Y_train)

I get:

ValueError: Unknown label type

I tried building a smaller pair of arrays:

print xs, '\n', ys

gives

[[0 0 0 0 1]
 [1 0 0 2 5]
 [3 2 5 5 0]
 [3 2 0 0 1]
 [1 5 1 0 0]]

['rotten' 'fresh' 'fresh' 'rotten' 'fresh']

and the multinomial NB fit gives no Unknown Label error. Any ideas on why this is happening?

I also checked the unique values in X_train, Y_train with numpy.unique and it doesn't seem like there are any weird or mistyped labels -- it's all 'fresh' or 'rotten'.

My code for generating X_train and Y_train:

def make_xy(critics, vectorizer=None):
    stext = critics['quote'].tolist() # need to have a list
    if vectorizer == None:
        vectorizer = CountVectorizer(min_df=0)
    vectorizer.fit(stext)
    X = vectorizer.transform(stext).toarray() # this is X
    Y = np.asarray(critics['fresh'])
    return X[0:1000,0:1000], Y[0:1000] # this is X_train, Y_train

where 'critics' is a pandas dataframe imported from a CSV file (https://www.dropbox.com/s/0lu5oujfm483wtr/critics.csv), and cleaned of any missing data:

critics = pd.read_csv('critics.csv')
critics = critics[~critics.quote.isnull()]
critics = critics[critics.fresh != 'none']
critics = critics[critics.quote.str.len() > 0]
like image 864
covariance Avatar asked Dec 21 '13 19:12

covariance


People also ask

What is the Unknown label type of unknown ValueError?

ValueError: Unknown label type: 'unknown' The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn. There is a mismatch in "What you can pass" Vs. "What you are actually passing".

What is the Unknown label type 'continuous' error in Python?

This article will tackle the causes and solutions to the ValueError: Unknown label type: 'continuous' error in Python. Python interpreter throws this error when we try to train sklearn imported classifier on the continuous target variable.

Why does sklearn throw ValueError unknown label type 'continuous'?

If sklearn imported classification algorithm, i.e., Logistic Regression is trained on the continuous target variable, it throws ValueError: Unknown label type:'continuous'. Float values as target label y are passed to the logistic regression classifier, which accepts categorical or discrete class labels.

What is the Unknown label type in scikit-learn?

The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn . There is a mismatch in "What you can pass" Vs. "What you are actually passing". Say between Array Vs. DataFrame or 1D list Vs. 2D list.


1 Answers

The problems seems to be the dtype of y. looks like numpy didnt manage to figure out it was a string. so it was set to a generic object. If you change:
Y = np.asarray(critics['fresh']) to Y = np.asarray(critics['fresh'], dtype="|S6") i think it should work.

like image 139
M4rtini Avatar answered Oct 14 '22 18:10

M4rtini