Convert column text data into features using python to use for machine learning

Tags:

enter image description here

Left-side CSV file has five columns .application column has several app types delimited with ;. Depending on the app, device and district type, I want to predict the target. But I want to first convert the file in to the right-hand side dataframe to apply machine learning.

How can I do this using python?

427

asked May 01 '19 17:05

Charith Ellepola

1 Answers

You need to apply multi-hot encoding for application column and one hot encoding for other columns.

Here is my solution!

>>> import pandas as pd
>>> import numpy as np

>>> df = pd.DataFrame({'number': np.random.randint(0,10,size=5),
                  'device': np.random.choice(['a','b'],size=5),
                  'application': ['app2;app3','app1','app2;app4', 'app1;app2', 'app1'],
                  'district': np.random.choice(['aa', 'bb', 'cc'],size=5)})

>>> df

    application device  district    number
0   app2;app3   b         aa    3
1   app1        a         cc    7
2   app2;app4   a         aa    3
3   app1;app2   b         bb    9
4   app1        a         cc    4

from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer

mlb = MultiLabelBinarizer()
# Assuming appl names are separated by ;
mhv = mlb.fit_transform(df['application'].apply(lambda x: set(x.split(';'))))
df_out = pd.DataFrame(mhv,columns=mlb.classes_)

enc = OneHotEncoder(sparse=False)
ohe_vars = ['device','district'] # specify the list of columns here
ohv = enc.fit_transform(df.loc[:,ohe_vars])
ohe_col_names = ['%s_%s'%(var,cat) for var,cats in zip(ohe_vars, enc.categories_) for cat in cats]

df_out.assign(**dict(zip(ohe_col_names,ohv.T)))

df_out

enter image description here

183

answered Sep 30 '22 09:09

Venkatachalam

Related questions
                            
                                Pandas: Filling data for missing dates
                            
                                Numpy tobytes() with defined byteorder
                            
                                calling a function with delay
                            
                                What's the fastest way to copy values from one tensor to another in PyTorch?
                            
                                Pandas groupby for multiple values in a column
                            
                                Skip directory name in import path by importing subpackage in __init__.py
                            
                                Numpy array with different standard deviation per row
                            
                                Pyspark error on creating dataframe: 'StructField' object has no attribute 'encode'
                            
                                How draw box across multiple axes on matplotlib using ax position as reference
                            
                                Why does custom Python object cannot be used with ParDo Fn?
                            
                                How to I make my AI algorithm play 9 board tic-tac-toe?
                            
                                ImageDataGenerator: how to add the 4th dimension to a numpy array?
                            
                                S3 Select retrieve headers in the CSV
                            
                                Building Python3.7.3 from source missing '_ctypes'
                            
                                what is the default encoding when python Requests post data is string type?
                            
                                ValueError: No module named 'notmigrations' during unit tests
                            
                                Most pythonic way to collect warnings from a function
                            
                                Create an ordered Index in sqlite db using SQLAlchemy
                            
                                ctypes.ArgumentError when using kivy with pywinauto
                            
                                Fastest way to replace part of a string in Pandas series if it contains a word in a list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Convert column text data into features using python to use for machine learning

Tags:

python

csv

multiple-columns

machine-learning

scikit-learn

Charith Ellepola

People also ask

1 Answers

Venkatachalam

Recent Activity

Donate For Us