I am trying to perform classification in Python using Pandas and scikit-learn. My dataset contains a mix of text variables, numerical variables and categorical variables.
Let's say my dataset looks like this:
Project Cost Project Category Project Description Project Outcome
12392.2 ABC This is a description Fully Funded
493992.4 DEF Stack Overflow rocks Expired
And I need to predict the variable Project Outcome
. Here is what I did (assuming df
contains my dataset):
I converted the categories Project Category
and Project Outcome
to numeric values
df['Project Category'] = df['Project Category'].factorize()[0]
df['Project Outcome'] = df['Project Outcome'].factorize()[0]
Dataset now looks like this:
Project Cost Project Category Project Description Project Outcome
12392.2 0 This is a description 0
493992.4 1 Stack Overflow rocks 1
Then I processed the text column using TF-IDF
tfidf_vectorizer = TfidfVectorizer()
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])
Dataset now looks something like this:
Project Cost Project Category Project Description Project Outcome
12392.2 0 (0, 249)\t0.17070240732941433\n (0, 304)\t0.. 0
493992.4 1 (0, 249)\t0.17070240732941433\n (0, 304)\t0.. 1
So since all variables are now numerical values, I thought I would be good to go to start training my model
X = df.drop(columns=['Project Outcome'], axis=1)
y = df['Project Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = MultinomialNB()
model.fit(X_train, y_train)
But I get the error ValueError: setting an array element with a sequence.
when attempting to do the model.fit
. When I print X_train
, I noticed that Project Description
was replaced by NaN
for some reason.
Any help on this? Is there a good way to do classification using variables with various data types? Thank you.
Replace this
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])
with
df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description']).toarray()
You can also use: tfidf_vectorizer.fit_transform(df['Project Description']).todense()
Also you should not simply convert categories to numbers. For example if you convert A, B and C to 0,1 and 2. They are taken as 2>1>0 and hence C>B>A which is usually not the case as A is just different than B and C. For this you can use One-Hot-Encoding (in Pandas you can use 'get_dummies' for this). You can use the code below for all your categorical features.
#df has all not categorical features
featurelist_categorical = ['Project Category', 'Feature A',
'Feature B']
for i,j in zip(featurelist_categorical, ['Project Category','A','B']):
df = pd.concat([df, pd.get_dummies(data[i],prefix=j)], axis=1)
The feature prefix is not necessary but will help you specially in case of multiple categorical features.
Also if you don't want to split your features into numbers for some reason you can use H2O.ai. With H2O you can directly feed categorical variables into models as text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With