Left-side CSV file has five columns .application
column has several app types delimited with ;
. Depending on the app
, device
and district
type, I want to predict the target
. But I want to first convert the file in to the right-hand side dataframe to apply machine learning.
How can I do this using python?
Step 1 : load the text. Step 2 : Split the text into tokens — -> it could be words , sentence or even paragraphs. Step 3 : We now need to convert all the words in to its lower case because computer reads Man and man differently. Step 4 : Remove the punctuation from each tokens.
The techniques used to turn Text into features can be referred to as “Text Vectorization” techniques, since they all aim at one purpose: turning text into vectors (or arrays, if you want it simpler; or tensors, if you want it more complex), that can be then fed to machine learning models in a classical way.
Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of Normalizer class of scikit-learn Python library.
You need to apply multi-hot encoding for application
column and one hot encoding for other columns.
Here is my solution!
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'number': np.random.randint(0,10,size=5),
'device': np.random.choice(['a','b'],size=5),
'application': ['app2;app3','app1','app2;app4', 'app1;app2', 'app1'],
'district': np.random.choice(['aa', 'bb', 'cc'],size=5)})
>>> df
application device district number
0 app2;app3 b aa 3
1 app1 a cc 7
2 app2;app4 a aa 3
3 app1;app2 b bb 9
4 app1 a cc 4
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# Assuming appl names are separated by ;
mhv = mlb.fit_transform(df['application'].apply(lambda x: set(x.split(';'))))
df_out = pd.DataFrame(mhv,columns=mlb.classes_)
enc = OneHotEncoder(sparse=False)
ohe_vars = ['device','district'] # specify the list of columns here
ohv = enc.fit_transform(df.loc[:,ohe_vars])
ohe_col_names = ['%s_%s'%(var,cat) for var,cats in zip(ohe_vars, enc.categories_) for cat in cats]
df_out.assign(**dict(zip(ohe_col_names,ohv.T)))
df_out
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With