I've been given a reasonably large Excel file (5k rows), also as a CSV, that I would like to make into a pandas multilevel DataFame. The file is structured like this:
SampleID OtherInfo Measurements Error Notes
sample1 stuff more stuff
36 6
26 7
37 8
sample2 newstuff lots of stuff
25 6
27 7
where the number of measurements is variable (and sometimes zero). There is no full blank row in between any of the information, and the 'Measurements' and 'Error' columns are empty on rows that have the other (string) data; this might make it harder to parse(?). Is there an easy way to automate this conversion? My initial idea is to parse the file with Python first and then feed stuff into DataFrame slots in a loop, but I don't know exactly how to implement it, or if it is even the best course of action.
Thanks in advance!
Looks like your file has fixed width columns, for which read_fwf() can be used.
In [145]: data = """\
SampleID OtherInfo Measurements Error Notes
sample1 stuff more stuff
36 6
26 7
37 8
sample2 newstuff lots of stuff
25 6
27 7
"""
In [146]: df = pandas.read_fwf(StringIO(data), widths=[12, 13, 14, 9, 15])
Ok, now we have the data, just a little bit of extra work and you have a frame on which you can use set_index() to create a MultiLevel index.
In [147]: df[['Measurements', 'Error']] = df[['Measurements', 'Error']].shift(-1)
In [148]: df[['SampleID', 'OtherInfo', 'Notes']] = df[['SampleID', 'OtherInfo', 'Notes']].fillna()
In [150]: df = df.dropna()
In [151]: df
Out[151]:
SampleID OtherInfo Measurements Error Notes
0 sample1 stuff 36 6 more stuff
1 sample1 stuff 26 7 more stuff
2 sample1 stuff 37 8 more stuff
4 sample2 newstuff 25 6 lots of stuff
5 sample2 newstuff 27 7 lots of stuff
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With