I have two numpy arrays, X_train and Y_train, where the first of dimensions (700,1000) is populated by the values 0, 1, 2, 3, 4, and 10. The second of dimensions (700,) is populated by the values 'fresh' or 'rotten', since I'm working with Rotten Tomatoes's API. For some reason, when I execute:
nb = MultinomialNB()
nb.fit(X_train, Y_train)
I get:
ValueError: Unknown label type
I tried building a smaller pair of arrays:
print xs, '\n', ys
gives
[[0 0 0 0 1]
[1 0 0 2 5]
[3 2 5 5 0]
[3 2 0 0 1]
[1 5 1 0 0]]
['rotten' 'fresh' 'fresh' 'rotten' 'fresh']
and the multinomial NB fit gives no Unknown Label error. Any ideas on why this is happening?
I also checked the unique values in X_train, Y_train with numpy.unique and it doesn't seem like there are any weird or mistyped labels -- it's all 'fresh' or 'rotten'.
My code for generating X_train and Y_train:
def make_xy(critics, vectorizer=None):
stext = critics['quote'].tolist() # need to have a list
if vectorizer == None:
vectorizer = CountVectorizer(min_df=0)
vectorizer.fit(stext)
X = vectorizer.transform(stext).toarray() # this is X
Y = np.asarray(critics['fresh'])
return X[0:1000,0:1000], Y[0:1000] # this is X_train, Y_train
where 'critics' is a pandas dataframe imported from a CSV file (https://www.dropbox.com/s/0lu5oujfm483wtr/critics.csv), and cleaned of any missing data:
critics = pd.read_csv('critics.csv')
critics = critics[~critics.quote.isnull()]
critics = critics[critics.fresh != 'none']
critics = critics[critics.quote.str.len() > 0]
ValueError: Unknown label type: 'unknown' The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn. There is a mismatch in "What you can pass" Vs. "What you are actually passing".
This article will tackle the causes and solutions to the ValueError: Unknown label type: 'continuous' error in Python. Python interpreter throws this error when we try to train sklearn imported classifier on the continuous target variable.
If sklearn imported classification algorithm, i.e., Logistic Regression is trained on the continuous target variable, it throws ValueError: Unknown label type:'continuous'. Float values as target label y are passed to the logistic regression classifier, which accepts categorical or discrete class labels.
The Unknown label type: 'unknown' error raised related to the Y values that you use in scikit-learn . There is a mismatch in "What you can pass" Vs. "What you are actually passing". Say between Array Vs. DataFrame or 1D list Vs. 2D list.
The problems seems to be the dtype of y. looks like numpy didnt manage to figure out it was a string. so it was set to a generic object. If you change:Y = np.asarray(critics['fresh'])
to Y = np.asarray(critics['fresh'], dtype="|S6")
i think it should work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With