Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert column text data into features using python to use for machine learning

enter image description here

Left-side CSV file has five columns .application column has several app types delimited with ;. Depending on the app, device and district type, I want to predict the target. But I want to first convert the file in to the right-hand side dataframe to apply machine learning.

How can I do this using python?

like image 427
Charith Ellepola Avatar asked May 01 '19 17:05

Charith Ellepola


People also ask

How do you prepare text data for machine learning?

Step 1 : load the text. Step 2 : Split the text into tokens — -> it could be words , sentence or even paragraphs. Step 3 : We now need to convert all the words in to its lower case because computer reads Man and man differently. Step 4 : Remove the punctuation from each tokens.

How do I convert text to features?

The techniques used to turn Text into features can be referred to as “Text Vectorization” techniques, since they all aim at one purpose: turning text into vectors (or arrays, if you want it simpler; or tensors, if you want it more complex), that can be then fed to machine learning models in a classical way.

How do you prepare a dataset for machine learning in Python?

Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.


1 Answers

You need to apply multi-hot encoding for application column and one hot encoding for other columns.

Here is my solution!

>>> import pandas as pd
>>> import numpy as np

>>> df = pd.DataFrame({'number': np.random.randint(0,10,size=5),
                  'device': np.random.choice(['a','b'],size=5),
                  'application': ['app2;app3','app1','app2;app4', 'app1;app2', 'app1'],
                  'district': np.random.choice(['aa', 'bb', 'cc'],size=5)})

>>> df

    application device  district    number
0   app2;app3   b         aa    3
1   app1        a         cc    7
2   app2;app4   a         aa    3
3   app1;app2   b         bb    9
4   app1        a         cc    4

from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer

mlb = MultiLabelBinarizer()
# Assuming appl names are separated by ;
mhv = mlb.fit_transform(df['application'].apply(lambda x: set(x.split(';'))))
df_out = pd.DataFrame(mhv,columns=mlb.classes_)

enc = OneHotEncoder(sparse=False)
ohe_vars = ['device','district'] # specify the list of columns here
ohv = enc.fit_transform(df.loc[:,ohe_vars])
ohe_col_names = ['%s_%s'%(var,cat) for var,cats in zip(ohe_vars, enc.categories_) for cat in cats]

df_out.assign(**dict(zip(ohe_col_names,ohv.T)))

df_out

enter image description here

like image 183
Venkatachalam Avatar answered Sep 30 '22 09:09

Venkatachalam