Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Split separated values in a DataFrame column (one Series) into multiple Columns. Elegant solutions?

I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. I'd like to split this column into multiple columns.

The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n comma-separated values. An example:

df.head():

i: vals   | sth_else 
---------------------
1: a,b,c  | ba
2: a,d    | be
3:        | bi
4: e,a,c  | bo
5: e      | bu

I'd like the following output (or similar, e.g. True/False):

i : a | b | c | d | e |  sth_else 
-----------------------------------
1:  1 | 1 | 1 | 0 | 0 | ba
2:  1 | 0 | 0 | 1 | 0 | be
3:  0 | 0 | 0 | 0 | 0 | bi
4:  1 | 0 | 1 | 0 | 1 | bo
5:  0 | 0 | 0 | 0 | 1 | bu

I'm currently experimenting with the Series.str.split and then Series.to_dict functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length. :)

Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). In any case, propositions are highly appreciated!

Here is the dummy.csv for testing.

vals;sth_else 
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu
like image 784
dmeu Avatar asked Dec 30 '25 22:12

dmeu


2 Answers

import pandas as pd
from StringIO import StringIO  # py2.7 used here
# from io.StringIO import StringIO    if you have py3.x

# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'

df = pd.read_csv(StringIO(csv_buffer), sep=';')

Out[58]: 
    vals sth_else
0  a,b,c       ba
1    a,d       be
2    NaN       bi
3  e,a,c       bo
4      e       bu

# processing
# ==================================================================
def func(group):
    return pd.Series(group.vals.str.split(',').values[0], name='vals')

ser = df.groupby(level=0).apply(func)

Out[60]: 
0  0      a
   1      b
   2      c
1  0      a
   1      d
2  0    NaN
3  0      e
   1      a
   2      c
4  0      e
Name: vals, dtype: object


# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)

Out[85]: 
     a  b  c  d  e
0 0  1  0  0  0  0
  1  0  1  0  0  0
  2  0  0  1  0  0
1 0  1  0  0  0  0
  1  0  0  0  1  0
2 0  0  0  0  0  0
3 0  0  0  0  0  1
  1  1  0  0  0  0
  2  0  0  1  0  0
4 0  0  0  0  0  1

# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())

Out[64]: 
   a  b  c  d  e
0  1  1  1  0  0
1  1  0  0  1  0
2  0  0  0  0  0
3  1  0  1  0  1
4  0  0  0  0  1


df_dummies['sth_else'] = df.sth_else

Out[67]: 
   a  b  c  d  e sth_else
0  1  1  1  0  0       ba
1  1  0  0  1  0       be
2  0  0  0  0  0       bi
3  1  0  1  0  1       bo
4  0  0  0  0  1       bu
like image 165
Jianxun Li Avatar answered Jan 01 '26 14:01

Jianxun Li


This is very similar to another question today. As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion:

#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals'  : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
                    'other' : ['ba', 'be', 'bi', 'bo', 'bu'] 
                  }, dtype = str)

Now create a new data frame with the other column form the DF as the index and columns that are drawn from the unique characters found in your val column in the DF:

New_DF = pd.DataFrame({col : 0 for col in 
                             set([letter for letter in ''.join([char for char in DF.vals.values]) 
                             if letter.isalpha()])},
                             index = DF.other)

In [51]: New_DF
Out[51]: 
       a  b  c  d  e
other               
ba     0  0  0  0  0
be     0  0  0  0  0
bi     0  0  0  0  0
bo     0  0  0  0  0
bu     0  0  0  0  0

Now simply iterate over the index of the New_DF slicing the original DF at that value and iterate over the columns to see if they appear in the relevant_string:

for ind in New_DF.index:
    relevant_string = str(DF[DF.other == ind].vals.values)
    for col in list(New_DF.columns):
        if col in relevant_string:
            New_DF.loc[ind, col] += 1

Output looks like this

In [54]: New_DF
Out[54]: 
       a  b  c  d  e
other               
ba     1  1  1  0  0
be     1  0  0  1  0
bi     0  0  0  0  0
bo     1  0  1  0  1
bu     0  0  0  0  1
like image 38
Woody Pride Avatar answered Jan 01 '26 12:01

Woody Pride



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!