Pandas: Separate column containing semicolon into multiple columns based on the values

Tags:

My data in ddata.csv is as follows:

col1,col2,col3,col4
A,10,a;b;c, 20
B,30,d;a;b,40
C,50,g;h;a,60

I want to separate col3 into multiple columns, but based on their values. In other wants, I would like my final data to look like

col1, col2, name_a, name_b, name_c, name_d, name_g, name_h, col4
A,    10,   a,      b,      c,      NULL,   NULL,   NULL,   20
B,    30,   a,      b,      NULL,   d,      NULL,   NULL,   40
C,    50,   a,      NULL,   NULL,   NULL,   g,      h,      60

My code, at the moment taken reference from this answer, is incomplete:

import pandas as pd

import string
L = list(string.ascii_lowercase)

names = dict(zip(range(len(L)), ['name_' + x for x in  L]))
df = pd.read_csv('ddata.csv')
df2 = df['col3'].str.split(';', expand=True).rename(columns=names)

Column names 'a','b','c' ... are taken at random, and has no relevance to the actual data a,b,c.

Right now, my code can just split 'col3' into three columns as follows:

name_a name_b name_c
a      b      c
d      e      f
g      h      i

But, it should be like name_a, name_b, name_c, name_d, name_g, name_h a, b, c, NULL, NULL, NULL a, b, NULL, d, NULL, NULL a, NULL, NULL, NULL, g, h

and in the end, I need to just replace col3 with these multiple columns.

589

asked May 13 '19 06:05

kingmakerking

1 Answers

Use Series.str.get_dummies:

print (df['col3'].str.get_dummies(';'))
   a  b  c  d  g  h
0  1  1  1  0  0  0
1  1  1  0  1  0  0
2  1  0  0  0  1  1

For extract column col3 from original use DataFrame.pop, create new DataFrame by multiple values by columns names in numpy, replace NaNs instead empty strings with DataFrame.where and DataFrame.add_prefix for new columns names.

pos = df.columns.get_loc('col3')

df2 = df.pop('col3').str.get_dummies(';').astype(bool)
df2 = (pd.DataFrame(df2.values * df2.columns.values[ None, :], 
                    columns=df2.columns,
                    index=df2.index)
         .where(df2)
         .add_prefix('name_'))

Last join all DataFrames filtered by positions with iloc join together by concat:

df = pd.concat([df.iloc[:, :pos], df2, df.iloc[:, pos:]], axis=1)
print (df)
  col1  col2 name_a name_b name_c name_d name_g name_h  col4
0    A    10      a      b      c    NaN    NaN    NaN    20
1    B    30      a      b    NaN      d    NaN    NaN    40
2    C    50      a    NaN    NaN    NaN      g      h    60

142

answered Sep 27 '22 23:09

jezrael

Related questions
                            
                                How do I find all 32 bit binary numbers that have exactly six 1 and rest 0
                            
                                Python: Using Elasticsearch Scan to get more than 10,000 results ScanError
                            
                                Sort huge JSON file using bash or python
                            
                                Finding closest value while grouping by a column
                            
                                Why is my program became really laggy after I added rotation, and how do I fix this?
                            
                                Understanding inconsistent cythonized code behavior - PyQt5 vs. PySide2
                            
                                Layout and Dropdown menu in Dash - Python
                            
                                Heroku app successfully deploying, but receiving application error when loading site
                            
                                Use the highest value for duplicate IDs (Pandas DataFrame)
                            
                                How to handle Google Authenticator with Selenium
                            
                                Pandas datetime week not as expected
                            
                                Displaying matplotlib plot using Flask
                            
                                Iterable unpacking and slice assignment
                            
                                requests.exceptions.ConnectionError: HTTPConnectionPool(host='127.0.0.1', port=8000): Max retries exceeded with url: /api/1/
                            
                                Enable APIs using serviceusage API with a service account
                            
                                How to install libcurl with nss backend in aws ec2? (Python 3.6 64bit Amazon Linux)
                            
                                Downsizing from Anaconda to Miniconda
                            
                                Tensorflow2.0 training: model.compile vs GradientTape
                            
                                Suppress OpenMP debug messages when running Tensorflow on CPU
                            
                                How to vectorize pandas dataframe forward column value search

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Separate column containing semicolon into multiple columns based on the values

Tags:

python

pandas

dataframe

csv

kingmakerking

People also ask

1 Answers

jezrael

Recent Activity

Donate For Us