My data in ddata.csv is as follows:
col1,col2,col3,col4
A,10,a;b;c, 20
B,30,d;a;b,40
C,50,g;h;a,60
I want to separate col3 into multiple columns, but based on their values. In other wants, I would like my final data to look like
col1, col2, name_a, name_b, name_c, name_d, name_g, name_h, col4
A, 10, a, b, c, NULL, NULL, NULL, 20
B, 30, a, b, NULL, d, NULL, NULL, 40
C, 50, a, NULL, NULL, NULL, g, h, 60
My code, at the moment taken reference from this answer, is incomplete:
import pandas as pd
import string
L = list(string.ascii_lowercase)
names = dict(zip(range(len(L)), ['name_' + x for x in L]))
df = pd.read_csv('ddata.csv')
df2 = df['col3'].str.split(';', expand=True).rename(columns=names)
Column names 'a','b','c' ... are taken at random, and has no relevance to the actual data a,b,c.
Right now, my code can just split 'col3' into three columns as follows:
name_a name_b name_c
a b c
d e f
g h i
But, it should be like name_a, name_b, name_c, name_d, name_g, name_h a, b, c, NULL, NULL, NULL a, b, NULL, d, NULL, NULL a, NULL, NULL, NULL, g, h
and in the end, I need to just replace col3 with these multiple columns.
Split column by delimiter into multiple columnsApply the pandas series str. split() function on the “Address” column and pass the delimiter (comma in this case) on which you want to split the column. Also, make sure to pass True to the expand parameter.
split() function is used to break up single column values into multiple columns based on a specified separator or delimiter. The Series. str. split() function is similar to the Python string split() method, but split() method works on the all Dataframe columns, whereas the Series.
split() Pandas provide a method to split string around a passed separator/delimiter. After that, the string can be stored as a list in a series or it can also be used to create multiple column data frames from a single separated string.
The pat parameter can be used to split by other characters. When using expand=True , the split elements will expand out into separate columns. If NaN is present, it is propagated throughout the columns during the split.
Use Series.str.get_dummies
:
print (df['col3'].str.get_dummies(';'))
a b c d g h
0 1 1 1 0 0 0
1 1 1 0 1 0 0
2 1 0 0 0 1 1
For extract column col3
from original use DataFrame.pop
, create new DataFrame
by multiple values by columns names in numpy, replace NaN
s instead empty strings with DataFrame.where
and DataFrame.add_prefix
for new columns names.
pos = df.columns.get_loc('col3')
df2 = df.pop('col3').str.get_dummies(';').astype(bool)
df2 = (pd.DataFrame(df2.values * df2.columns.values[ None, :],
columns=df2.columns,
index=df2.index)
.where(df2)
.add_prefix('name_'))
Last join all DataFrames filtered by positions with iloc
join together by concat
:
df = pd.concat([df.iloc[:, :pos], df2, df.iloc[:, pos:]], axis=1)
print (df)
col1 col2 name_a name_b name_c name_d name_g name_h col4
0 A 10 a b c NaN NaN NaN 20
1 B 30 a b NaN d NaN NaN 40
2 C 50 a NaN NaN NaN g h 60
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With