Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tidy data from multilevel Excel file via pandas

I want to produce tidy data from an Excel file which looks like this, with three levels of "merged" headers:

enter image description here

Pandas reads the file just fine, with multilevel headers:

# df = pandas.read_excel('test.xlsx', header=[0,1,2])

For repeatability, you can copy-paste this:

df = pandas.DataFrame({('Unnamed: 0_level_0', 'Unnamed: 0_level_1', 'a'): {1: 'aX', 2: 'aY'}, ('Unnamed: 1_level_0', 'Unnamed: 1_level_1', 'b'): {1: 'bX', 2: 'bY'}, ('Unnamed: 2_level_0', 'Unnamed: 2_level_1', 'c'): {1: 'cX', 2: 'cY'}, ('level1_1', 'level2_1', 'level3_1'): {1: 1, 2: 10}, ('level1_1', 'level2_1', 'level3_2'): {1: 2, 2: 20}, ('level1_1', 'level2_2', 'level3_1'): {1: 3, 2: 30}, ('level1_1', 'level2_2', 'level3_2'): {1: 4, 2: 40}, ('level1_2', 'level2_1', 'level3_1'): {1: 5, 2: 50}, ('level1_2', 'level2_1', 'level3_2'): {1: 6, 2: 60}, ('level1_2', 'level2_2', 'level3_1'): {1: 7, 2: 70}, ('level1_2', 'level2_2', 'level3_2'): {1: 8, 2: 80}})

I want to normalise this so that the level headings are in variable rows, but retain columns a, b and c as columns:

Desired output

Without the multi-level headers, I would do pandas.melt(df, id_vars=['a', 'b', 'c']) to get what I want. pandas.melt(df) gives me the three variable columns I want, but obviously doesn't retain the a, b, and c columns.

like image 926
chthonicdaemon Avatar asked Oct 29 '16 13:10

chthonicdaemon


2 Answers

It should be as simple as:

wide_df = pandas.read_excel(xlfile, sheetname, header=[0, 1, 2], index_col=[0, 1, 2, 3])

long_df = wide_df.stack().stack().stack()

Here's an example with a mock-up CSV file (note the 4th row to label the index and the first column to label the header levels):

from io import StringIO
from textwrap import dedent

import pandas

mockcsv = StringIO(dedent("""\
    num,,,this1,this1,this1,this1,that1,that1,that1,that1
    let,,,thisA,thisA,thatA,thatA,thisB,thisB,thatB,thatB
    animal,,,cat,dog,bird,lizard,cat,dog,bird,lizard
    a,b,c,,,,,,,,
    a1,b1,c1,x1,x2,x3,x4,x5,x6,x7,x8
    a1,b1,c2,y1,y2,y3,y4,y5,y6,y7,y8
    a1,b2,c1,z1,z2,z3,z4,z5,6z,zy,z8
"""))


wide_df = pandas.read_csv(mockcsv, index_col=[0, 1, 2], header=[0, 1, 2])
long_df = wide_df.stack().stack().stack()

So wide_df looks like this:

num      this1                  that1                 
let      thisA     thatA        thisB     thatB       
animal     cat dog  bird lizard   cat dog  bird lizard
a  b  c                                               
a1 b1 c1    x1  x2    x3     x4    x5  x6    x7     x8
      c2    y1  y2    y3     y4    y5  y6    y7     y8
   b2 c1    z1  z2    z3     z4    z5  6z    zy     z8

And long_df

a   b   c   animal  let    num  
a1  b1  c1  bird    thatA  this1    x3
                    thatB  that1    x7
            cat     thisA  this1    x1
                    thisB  that1    x5
            dog     thisA  this1    x2
                    thisB  that1    x6
            lizard  thatA  this1    x4
                    thatB  that1    x8
        c2  bird    thatA  this1    y3
                    thatB  that1    y7
            cat     thisA  this1    y1
                    thisB  that1    y5
            dog     thisA  this1    y2
                    thisB  that1    y6
            lizard  thatA  this1    y4
                    thatB  that1    y8
    b2  c1  bird    thatA  this1    z3
                    thatB  that1    zy
            cat     thisA  this1    z1
                    thisB  that1    z5
            dog     thisA  this1    z2
                    thisB  that1    6z
            lizard  thatA  this1    z4
                    thatB  that1    z8

With literal data shown in the OP, you can get at this w/o modifying anything by doing the following:

index_names = ['a', 'b', 'c']
col_names = ['Level1', 'Level2', 'Level3']
df = (
    pandas.read_excel('Book1.xlsx', header=[0, 1, 2], index_col=[0, 1, 2, 3])
        .reset_index(level=0, drop=True)
        .rename_axis(index_names, axis='index')
        .rename_axis(col_names, axis='columns')
        .stack()
        .stack()
        .stack()
        .to_frame()
)

I think the tricky part will be inspecting each of your files to figure out what index_names should be.

like image 84
Paul H Avatar answered Oct 26 '22 02:10

Paul H


Fragment the DFs into two portions for the ease of melting and joining them back.

first_half = df.iloc[:, :3]
second_half = df.iloc[:, 3:]

Melt the second fragment.

melt_second_half = pd.melt(second_half)

Repeat the values in the first fragment by calculating the value found by dividing the number of rows in the melted DF with it's own length.

repeats = int(melt_second_half.shape[0]/first_half.shape[0])
first_reps = pd.concat([first_half] * repeats, ignore_index=True)
col_names = first_reps.columns.get_level_values(2)
melt_first_half = pd.DataFrame(first_reps.values, columns=col_names)

Concatenate both back and sort the resulting DF according to the value column.

df_concat = pd.concat([melt_first_half, melt_second_half], axis=1)
df_concat.sort_values('value').reset_index(drop=True)

Image

like image 33
Nickil Maveli Avatar answered Oct 26 '22 03:10

Nickil Maveli