Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python scikit-learn classification with mixed data types (text, numerical, categorical)

I am trying to perform classification in Python using Pandas and scikit-learn. My dataset contains a mix of text variables, numerical variables and categorical variables.

Let's say my dataset looks like this:

Project Cost        Project Category        Project Description       Project Outcome
12392.2             ABC                     This is a description     Fully Funded
493992.4            DEF                     Stack Overflow rocks      Expired

And I need to predict the variable Project Outcome. Here is what I did (assuming df contains my dataset):

  1. I converted the categories Project Category and Project Outcome to numeric values

    df['Project Category'] = df['Project Category'].factorize()[0]
    df['Project Outcome'] = df['Project Outcome'].factorize()[0]
    

Dataset now looks like this:

Project Cost        Project Category        Project Description       Project Outcome
12392.2             0                       This is a description     0
493992.4            1                       Stack Overflow rocks      1
  1. Then I processed the text column using TF-IDF

    tfidf_vectorizer = TfidfVectorizer()
    df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])
    

Dataset now looks something like this:

Project Cost        Project Category        Project Description       Project Outcome
12392.2             0                       (0, 249)\t0.17070240732941433\n (0, 304)\t0..     0
493992.4            1                       (0, 249)\t0.17070240732941433\n (0, 304)\t0..     1
  1. So since all variables are now numerical values, I thought I would be good to go to start training my model

    X = df.drop(columns=['Project Outcome'], axis=1)
    y = df['Project Outcome']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    model = MultinomialNB()
    model.fit(X_train, y_train)
    

But I get the error ValueError: setting an array element with a sequence. when attempting to do the model.fit. When I print X_train, I noticed that Project Description was replaced by NaN for some reason.

Any help on this? Is there a good way to do classification using variables with various data types? Thank you.

like image 768
vdvaxel Avatar asked Oct 17 '22 11:10

vdvaxel


1 Answers

Replace this

df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description'])

with

df['Project Description'] = tfidf_vectorizer.fit_transform(df['Project Description']).toarray()

You can also use: tfidf_vectorizer.fit_transform(df['Project Description']).todense()

Also you should not simply convert categories to numbers. For example if you convert A, B and C to 0,1 and 2. They are taken as 2>1>0 and hence C>B>A which is usually not the case as A is just different than B and C. For this you can use One-Hot-Encoding (in Pandas you can use 'get_dummies' for this). You can use the code below for all your categorical features.

#df has all not categorical features
featurelist_categorical = ['Project Category', 'Feature A',
           'Feature B']

for i,j in zip(featurelist_categorical, ['Project Category','A','B']):
  df = pd.concat([df, pd.get_dummies(data[i],prefix=j)], axis=1)

The feature prefix is not necessary but will help you specially in case of multiple categorical features.

Also if you don't want to split your features into numbers for some reason you can use H2O.ai. With H2O you can directly feed categorical variables into models as text.

like image 53
amalik2205 Avatar answered Oct 20 '22 22:10

amalik2205