Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split multiple columns in Pandas

Tags:

python

pandas

I have a data frame like below:

df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
                   'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
                   'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]: 
                var1              var2                var3
0    0,3788,99,20.88   0,929,92,299.90     8,9332,99,29.10
1  3,99022,08,91.995  1,38333,9,993.11  7,922111,07,45.443

I want to split each column on comma and same the new set of columns next to each other. So the resulting data frame should look like below:

df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
                    ('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
                    ('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})

Out[249]: 
  var1                    var2                    var3                    
    x1     x2  x3      x4   x1     x2  x3      x4   x1      x2  x3      x4
0    0   3788  99  20.880    0    929  92  299.90    8    9332  99  29.100
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

The MultiIndex is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed:

    var  x1      x2  x3       x4
0  var1   0    3788  99   20.880
1  var1   3   99022  08   91.995
0  var2   0     929  92  299.900
1  var2   1   38333   9  993.110
0  var3   8    9332  99   29.100
1  var3   7  922111  07   45.443

My effort included pd.melt and str.split:

df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
    .sort_values(['index', 'var']) \
    .set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]

But: 1) I don't know how to then spread the data for different var1, var2, var3... next to each other 2) transforming from wide format to long format (df to df_long) and back again (df_long to df3) seems highly inefficient and I care for performance with the seeking solution.

So what's the best way to transform from df to df2, so that we could then easily obtain df3 if needed?

like image 970
jakes Avatar asked Mar 20 '26 15:03

jakes


2 Answers

You can use stack , str.split() with expand=True , unstack() to achieve this:

final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
                                             .sort_index(level=0,axis=1))
print(final)

     var1                    var2                    var3                    
     0      1   2       3    0      1   2       3    0       1   2       3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

For renaming the 0th level of the columns, use;

final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])

   var1                    var2                    var3                       
    x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0    0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1    3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443

You can also use the below for the second output shown in your question:

df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)

  level_1 x0      x1  x2      x3
0    var1  0    3788  99   20.88
1    var2  0     929  92  299.90
2    var3  8    9332  99   29.10
3    var1  3   99022  08  91.995
4    var2  1   38333   9  993.11
5    var3  7  922111  07  45.443
like image 128
anky Avatar answered Mar 23 '26 04:03

anky


Here is an approach that gets df3 first:

df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])

print(df3)
  x0      x1  x2      x3   var
0  0    3788  99   20.88  var1
1  3   99022  08  91.995  var1
0  0     929  92  299.90  var2
1  1   38333   9  993.11  var2
0  8    9332  99   29.10  var3
1  7  922111  07  45.443  var3

And then:

df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)

print(df2)
var var1                    var2                    var3                    
      x0     x1  x2      x3   x0     x1  x2      x3   x0      x1  x2      x3
0      0   3788  99   20.88    0    929  92  299.90    8    9332  99   29.10
1      3  99022  08  91.995    1  38333   9  993.11    7  922111  07  45.443
like image 42
Horace Avatar answered Mar 23 '26 04:03

Horace



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!