I have a data frame like below:
df = pd.DataFrame({'var1': ['0,3788,99,20.88', '3,99022,08,91.995'],
'var2': ['0,929,92,299.90', '1,38333,9,993.11'],
'var3': ['8,9332,99,29.10', '7,922111,07,45.443']})
Out[248]:
var1 var2 var3
0 0,3788,99,20.88 0,929,92,299.90 8,9332,99,29.10
1 3,99022,08,91.995 1,38333,9,993.11 7,922111,07,45.443
I want to split each column on comma and same the new set of columns next to each other. So the resulting data frame should look like below:
df2 = pd.DataFrame({('var1', 'x1'): [0, 3], ('var1', 'x2'): [3788, 99022], ('var1', 'x3'): [99, '08'], ('var1', 'x4'): [20.88, 91.995],
('var2', 'x1'): [0, 1], ('var2', 'x2'): [929, 38333], ('var2', 'x3'): [92, 9], ('var2', 'x4'): [299.90, 993.11],
('var3', 'x1'): [8, 7], ('var3', 'x2'): [9332, 922111], ('var3', 'x3'): [99, '07'], ('var3', 'x4'): [29.10, 45.443]})
Out[249]:
var1 var2 var3
x1 x2 x3 x4 x1 x2 x3 x4 x1 x2 x3 x4
0 0 3788 99 20.880 0 929 92 299.90 8 9332 99 29.100
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
The MultiIndex is not mandatory, but then I'd like to have an opportunity to easily gather the data and obtain df3 if needed:
var x1 x2 x3 x4
0 var1 0 3788 99 20.880
1 var1 3 99022 08 91.995
0 var2 0 929 92 299.900
1 var2 1 38333 9 993.110
0 var3 8 9332 99 29.100
1 var3 7 922111 07 45.443
My effort included pd.melt and str.split:
df_long = pd.melt(df.reset_index(drop = False), id_vars = 'index', var_name = 'var', value_name = 'values') \
.sort_values(['index', 'var']) \
.set_index('index')
df_long = df_long['values'].str.split(',', expand = True)
df_long.columns = ['x' + str(i) for i in range(df_long.shape[1])]
But:
1) I don't know how to then spread the data for different var1, var2, var3... next to each other
2) transforming from wide format to long format (df to df_long) and back again (df_long to df3) seems highly inefficient and I care for performance with the seeking solution.
So what's the best way to transform from df to df2, so that we could then easily obtain df3 if needed?
You can use stack , str.split() with expand=True , unstack() to achieve this:
final=(df.stack().str.split(',',expand=True).unstack().swaplevel(axis=1)
.sort_index(level=0,axis=1))
print(final)
var1 var2 var3
0 1 2 3 0 1 2 3 0 1 2 3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
For renaming the 0th level of the columns, use;
final.columns=pd.MultiIndex.from_tuples([(a,f'x{b}') for a,b in final.columns])
var1 var2 var3
x0 x1 x2 x3 x0 x1 x2 x3 x0 x1 x2 x3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
You can also use the below for the second output shown in your question:
df.stack().str.split(',',expand=True).add_prefix('x').reset_index(1).reset_index(drop=True)
level_1 x0 x1 x2 x3
0 var1 0 3788 99 20.88
1 var2 0 929 92 299.90
2 var3 8 9332 99 29.10
3 var1 3 99022 08 91.995
4 var2 1 38333 9 993.11
5 var3 7 922111 07 45.443
Here is an approach that gets df3 first:
df3 = pd.concat([df[s].str.split(',', expand=True).add_prefix("x").assign(var=s) for s in df])
print(df3)
x0 x1 x2 x3 var
0 0 3788 99 20.88 var1
1 3 99022 08 91.995 var1
0 0 929 92 299.90 var2
1 1 38333 9 993.11 var2
0 8 9332 99 29.10 var3
1 7 922111 07 45.443 var3
And then:
df2 = df3.set_index("var", append=True).unstack().swaplevel(axis=1).sort_index(axis=1)
print(df2)
var var1 var2 var3
x0 x1 x2 x3 x0 x1 x2 x3 x0 x1 x2 x3
0 0 3788 99 20.88 0 929 92 299.90 8 9332 99 29.10
1 3 99022 08 91.995 1 38333 9 993.11 7 922111 07 45.443
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With