Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas- split a multiple index dataframe

I have a multiple indexed pandas data frame where I want to separate values by '||' character and include one more layer of index with three new columns 'Connection', 'Val1' 'Val2'.

Will be helpful if someone could give me few hints to do this.

Current example Data Frame:

                              Experiment1                        Experiment2      
    Target      Analyze1_ab               Analyze2_zz            Analyze1_yy        
    XXX_1   Edge2||3.1E-07||-0.5    Edge2||2.1E-06||-0.9     Edge2||6.4E-02||-0.3   
    XXX_4   Edge1||6.4E-12||1.1     Edge1||2.4E-11||9.4      Edge1||1.4E-11||1.4    
    ABC_1   Edge1||3.9E-07||0.7     Edge1||2.9E-07||5.6      Edge1||6.8E-02||0.4    
    ABC_2   Edge2||1.1E-09||-0.5    Edge2||1.2E-09||1.2      Edge2||1.0E-03||-0.5   
    ABC_3   Edge2||4.6E-25||-0.8    Edge2||2.6E-10||1.9      Edge2||5.0E-17||-0.9   
    XXX_2   Edge2||1.7E-07||-0.5    Edge2||5.7E-08||-0.3     Edge2||4.1E-02||-0.3   
    ABC_4   Edge1||8.1E-02||0.5     Edge1||9.1E-02||1.5      Edge1||5.4E-02||0.6    
    ABC_5   Edge1||6.7E-02||0.3     Edge1||4.2E-02||1.9      Edge1||5.6E-03||0.4    
    XXX_3   Edge2||3.1E-03||-0.4    Edge1||2.4E-11||1.1      Edge2||2.4E-02||-0.3   

Desired Data Frame:

                                   Experiment1                            Experiment2       
    Target          Analyze1_ab                  Analyze2_zz                Analyze1_yy     
           Connection   Val1    Val2    Connection  Val1    Val2    Connection  Val1    Val2
    XXX_1   Edge2   3.10E-07    -0.5    Edge2   2.10E-06    -0.9    Edge2   6.40E-02    -0.3
    XXX_4   Edge1   6.40E-12    1.1     Edge1   2.40E-11    9.4     Edge1   1.40E-11    1.4
    ABC_1   Edge1   3.90E-07    0.7     Edge1   2.90E-07    5.6     Edge1   6.80E-02    0.4
    ABC_2   Edge2   1.10E-09    -0.5    Edge2   1.20E-09    1.2     Edge2   1.00E-03    -0.5
    ABC_3   Edge2   4.60E-25    -0.8    Edge2   2.60E-10    1.9     Edge2   5.00E-17    -0.9
    XXX_2   Edge2   1.70E-07    -0.5    Edge2   5.70E-08    -0.3    Edge2   4.10E-02    -0.3
    ABC_4   Edge1   8.10E-02    0.5     Edge1   9.10E-02    1.5     Edge1   5.40E-02    0.6
    ABC_5   Edge1   6.70E-02    0.3     Edge1   4.20E-02    1.9     Edge1   5.60E-03    0.4
    XXX_3   Edge2   3.10E-03    -0.4    Edge1   2.40E-11    1.1     Edge2   2.40E-02    -0.3
like image 905
Rtut Avatar asked Mar 09 '23 02:03

Rtut


2 Answers

Setup

df
Out[2319]: 
      ID           Experiment1                                 Experiment2
  Target           Analyze1_ab           Analyze2_ab           Analyze1_yy
0  XXX_1  Edge2||3.1E-07||-0.5  Edge2||2.1E-06||-0.9  Edge2||6.4E-02||-0.3
1  XXX_4   Edge1||6.4E-12||1.1   Edge1||2.4E-11||9.4   Edge1||1.4E-11||1.4
2  ABC_1   Edge1||3.9E-07||0.7   Edge1||2.9E-07||5.6   Edge1||6.8E-02||0.4
3  ABC_2  Edge2||1.1E-09||-0.5   Edge2||1.2E-09||1.2  Edge2||1.0E-03||-0.5
4  ABC_3  Edge2||4.6E-25||-0.8   Edge2||2.6E-10||1.9  Edge2||5.0E-17||-0.9
5  XXX_2  Edge2||1.7E-07||-0.5  Edge2||5.7E-08||-0.3  Edge2||4.1E-02||-0.3
6  ABC_4   Edge1||8.1E-02||0.5   Edge1||9.1E-02||1.5   Edge1||5.4E-02||0.6
7  ABC_5   Edge1||6.7E-02||0.3   Edge1||4.2E-02||1.9   Edge1||5.6E-03||0.4
8  XXX_3  Edge2||3.1E-03||-0.4   Edge1||2.4E-11||1.1  Edge2||2.4E-02||-0.3

Solution

#split columns by '||' and rebuild a Dataframe with the separated columns
df2 = pd.DataFrame(np.asarray(df.iloc[:,1:].apply(lambda x: x.str.split('\|\|')).values.tolist()).reshape(9,-1))
#set Multilevel columns
df2.columns=pd.MultiIndex.from_tuples([('Experiment1','Analyze1_ab','Connection'),
('Experiment1','Analyze1_ab','Val1'),
('Experiment1','Analyze1_ab','Val2'),
('Experiment1','Analyze2_zz','Connection'),
('Experiment1','Analyze2_zz','Val1'),
('Experiment1','Analyze2_zz','Val2'),
('Experiment2','Analyze1_yy','Connection'),
('Experiment2','Analyze1_yy','Val1'),
('Experiment2','Analyze1_yy','Val2')])
#add Target column
df2.insert(0,'Target',df.iloc[:,0])

Out[2324]: 
  Target Experiment1                                           Experiment2  
         Analyze1_ab                Analyze2_zz                Analyze1_yy   
          Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2 
0  XXX_1       Edge2  3.1E-07  -0.5       Edge2  2.1E-06  -0.9       Edge2  6.4E-02  -0.3 
1  XXX_4       Edge1  6.4E-12   1.1       Edge1  2.4E-11   9.4       Edge1  1.4E-11   1.4 
2  ABC_1       Edge1  3.9E-07   0.7       Edge1  2.9E-07   5.6       Edge1  6.8E-02   0.4 
3  ABC_2       Edge2  1.1E-09  -0.5       Edge2  1.2E-09   1.2       Edge2  1.0E-03  -0.5 
4  ABC_3       Edge2  4.6E-25  -0.8       Edge2  2.6E-10   1.9       Edge2  5.0E-17  -0.9 
5  XXX_2       Edge2  1.7E-07  -0.5       Edge2  5.7E-08  -0.3       Edge2  4.1E-02  -0.3 
6  ABC_4       Edge1  8.1E-02   0.5       Edge1  9.1E-02   1.5       Edge1  5.4E-02   0.6 
7  ABC_5       Edge1  6.7E-02   0.3       Edge1  4.2E-02   1.9       Edge1  5.6E-03   0.4 
8  XXX_3       Edge2  3.1E-03  -0.4       Edge1  2.4E-11   1.1       Edge2  2.4E-02  -0.3
like image 159
Allen Avatar answered Mar 20 '23 23:03

Allen


import pandas as pd

# Initialize DataFrame
# -----------------------------------------------------------------------------

df = pd.DataFrame({
    'Analyze1_ab': ['Edge2||3.1E-07||-0.5', 'Edge1||6.4E-12||1.1'],
    'Analyze2_zz': ['Edge2||2.1E-06||-0.9', 'Edge1||2.4E-11||9.4'],
    'Analyze1_yy': ['Edge2||6.4E-02||-0.3', 'Edge1||1.4E-11||1.4'],
    'Target': ['XXX_1', 'XXX_4'],})

df.columns = pd.MultiIndex.from_tuples(
    [('Experiment1', 'Analyze1_ab'),
     ('Experiment2', 'Analyze1_yy'),
     ('Experiment1', 'Analyze2_zz'),
     ('Target', '')])

# Split 'Analyses' columns by double pipes ||
# -----------------------------------------------------------------------------

# Initialize final DataFrame
final_df = pd.DataFrame()

for col_name in df.columns:
    if (col_name[1].startswith('Analyze') and
            df[col_name].str.contains('||').all()):
        # Split 'Analysis' by || into new columns
        splitted_analysis = df[col_name].str.split('\|\|', expand=True)
        # The new column names are 0, 1, 2. Let's rename them.
        splitted_analysis.columns = ['Connection', 'Val1', 'Val2']
        # Recreate MultiIndex
        splitted_analysis.columns = pd.MultiIndex.from_tuples(
            [(col_name[0], col_name[1], c) for c in splitted_analysis.columns])
        # Concatenate the new columns to the final_df
        final_df = pd.concat(objs=[final_df, splitted_analysis], axis=1)

# Add 'Target' column in the final_df.
# First, extract it.
target_col = pd.DataFrame(df[('Target', '')])
# Then, increase MultiIndex level of 'Target' from 2 to 3,
# to allow smooth concatenation with the final_df.
target_col.columns = pd.MultiIndex.from_tuples([('Target', '', '')])
final_df = pd.concat([final_df, target_col], axis=1)

Validation: print(final_df):

  Experiment1                Experiment2                Experiment1                Target
  Analyze1_ab                Analyze1_yy                Analyze2_zz                      
   Connection     Val1  Val2  Connection     Val1  Val2  Connection     Val1  Val2       
0       Edge2  3.1E-07  -0.5       Edge2  6.4E-02  -0.3       Edge2  2.1E-06  -0.9  XXX_1
1       Edge1  6.4E-12   1.1       Edge1  1.4E-11   1.4       Edge1  2.4E-11   9.4  XXX_4

Validation: pprint.pprint([c for c in final_df.columns]):

[('Experiment1', 'Analyze1_ab', 'Connection'),
 ('Experiment1', 'Analyze1_ab', 'Val1'),
 ('Experiment1', 'Analyze1_ab', 'Val2'),
 ('Experiment2', 'Analyze1_yy', 'Connection'),
 ('Experiment2', 'Analyze1_yy', 'Val1'),
 ('Experiment2', 'Analyze1_yy', 'Val2'),
 ('Experiment1', 'Analyze2_zz', 'Connection'),
 ('Experiment1', 'Analyze2_zz', 'Val1'),
 ('Experiment1', 'Analyze2_zz', 'Val2'),
 ('Target', '', '')]
like image 42
tuomastik Avatar answered Mar 20 '23 23:03

tuomastik