Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert excel or csv file to pandas multilevel dataframe

I've been given a reasonably large Excel file (5k rows), also as a CSV, that I would like to make into a pandas multilevel DataFame. The file is structured like this:

SampleID    OtherInfo    Measurements    Error    Notes
sample1     stuff                                 more stuff
                         36              6
                         26              7
                         37              8
sample2     newstuff                              lots of stuff
                         25              6
                         27              7

where the number of measurements is variable (and sometimes zero). There is no full blank row in between any of the information, and the 'Measurements' and 'Error' columns are empty on rows that have the other (string) data; this might make it harder to parse(?). Is there an easy way to automate this conversion? My initial idea is to parse the file with Python first and then feed stuff into DataFrame slots in a loop, but I don't know exactly how to implement it, or if it is even the best course of action.

Thanks in advance!

like image 511
cossatot Avatar asked Oct 06 '22 13:10

cossatot


1 Answers

Looks like your file has fixed width columns, for which read_fwf() can be used.

In [145]: data = """\
SampleID    OtherInfo    Measurements    Error    Notes                   
sample1     stuff                                 more stuff              
                         36              6
                         26              7
                         37              8
sample2     newstuff                              lots of stuff           
                         25              6
                         27              7
"""

In [146]: df = pandas.read_fwf(StringIO(data), widths=[12, 13, 14, 9, 15])

Ok, now we have the data, just a little bit of extra work and you have a frame on which you can use set_index() to create a MultiLevel index.

In [147]: df[['Measurements', 'Error']] = df[['Measurements', 'Error']].shift(-1)

In [148]: df[['SampleID', 'OtherInfo', 'Notes']] = df[['SampleID', 'OtherInfo', 'Notes']].fillna()

In [150]: df = df.dropna()

In [151]: df
Out[151]:
  SampleID OtherInfo  Measurements  Error          Notes
0  sample1     stuff            36      6     more stuff
1  sample1     stuff            26      7     more stuff
2  sample1     stuff            37      8     more stuff
4  sample2  newstuff            25      6  lots of stuff
5  sample2  newstuff            27      7  lots of stuff
like image 52
Wouter Overmeire Avatar answered Oct 10 '22 02:10

Wouter Overmeire