Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas read multiindexed csv with blanks

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8

CSV Header

What I would like to get is:

Desired Pandas Header

When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:

Incorrect result

Is there a way to get the desired result?


Note: alternatively, I would accept this as a result:

Alternative result


Versions used:

  • Python: 2.7.8
  • Pandas 0.16.0
like image 646
Julien Marrec Avatar asked May 19 '15 10:05

Julien Marrec


2 Answers

Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:

columns = pd.DataFrame(df.columns.tolist())

then rename the Unnamed: columns to NaN:

columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan

and then forward-fill the NaNs:

columns[0] = columns[0].fillna(method='ffill')

so that columns now looks like

In [314]: columns
Out[314]: 
     0  1
0  NaN  A
1  NaN  B
2    C  X
3    C  Y
4    C  Z
5    D  X
6    D  Y
7    D  Z

Now we can find the remaining NaNs and fill them with empty strings:

mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')

To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:

columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values

Now you can build a new MultiIndex and assign it to df.columns:

df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())

Putting it all together, if data is

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

then

import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
    print(df)

yields

   A  B  C        D      
         X  Y  Z  X  Y  Z
0  1  2  3  4  5  6  7  8
1  3  4  5  6  7  8  9  0
like image 190
unutbu Avatar answered Sep 22 '22 10:09

unutbu


There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:

names = ['A', 'B', 
         ('C','X'), ('C', 'Y'), ('C', 'Z'),
         ('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
            header=1, names=names, index_col=[0, 1])

Gives:

     C        D      
     X  Y  Z  X  Y  Z
A B                  
1 2  3  4  5  6  7  8

To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.

pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])

Then access the columns and loop to create your header. Again, not a very clean solution, but should work.

like image 27
firelynx Avatar answered Sep 19 '22 10:09

firelynx