Pandas read multiindexed csv with blanks

Question

I'm struggling with properly loading a csv that has a multi lines header with blanks. The CSV looks like this:

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8

CSV Header

What I would like to get is:

Desired Pandas Header

When I try to load with pd.read_csv(file, header=[0,1], sep=','), I end up with the following:

Incorrect result

Is there a way to get the desired result?

Note: alternatively, I would accept this as a result:

Alternative result

Versions used:

Python: 2.7.8
Pandas 0.16.0

unutbu · Accepted Answer

Here is an automated way to fix the column index. First, pull the column level values into a DataFrame:

columns = pd.DataFrame(df.columns.tolist())

then rename the Unnamed: columns to NaN:

columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan

and then forward-fill the NaNs:

columns[0] = columns[0].fillna(method='ffill')

so that columns now looks like

In [314]: columns
Out[314]: 
     0  1
0  NaN  A
1  NaN  B
2    C  X
3    C  Y
4    C  Z
5    D  X
6    D  Y
7    D  Z

Now we can find the remaining NaNs and fill them with empty strings:

mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')

To make the first two columns, A and B, indexable as df['A'] and df['B'] -- as though they were single-leveled -- you could swap the values in the first and second columns:

columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values

Now you can build a new MultiIndex and assign it to df.columns:

df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())

Putting it all together, if data is

,,C,,,D,,
A,B,X,Y,Z,X,Y,Z
1,2,3,4,5,6,7,8
3,4,5,6,7,8,9,0

then

import numpy as np
import pandas as pd
df = pd.read_csv('data', header=[0,1], sep=',')
columns = pd.DataFrame(df.columns.tolist())
columns.loc[columns[0].str.startswith('Unnamed:'), 0] = np.nan
columns[0] = columns[0].fillna(method='ffill')
mask = pd.isnull(columns[0])
columns[0] = columns[0].fillna('')
columns.loc[mask, [0,1]] = columns.loc[mask, [1,0]].values
df.columns = pd.MultiIndex.from_tuples(columns.to_records(index=False).tolist())
    print(df)

yields

   A  B  C        D      
         X  Y  Z  X  Y  Z
0  1  2  3  4  5  6  7  8
1  3  4  5  6  7  8  9  0

firelynx · Answer

There is no magical way of making pandas aware of how you want your index to look, the closest way you can do this is by specifying a lot yourself, like this:

names = ['A', 'B', 
         ('C','X'), ('C', 'Y'), ('C', 'Z'),
         ('D','X'), ('D','Y'), ('D', 'Z')]
pd.read_csv(file, mangle_dupe_cols=True,
            header=1, names=names, index_col=[0, 1])

Gives:

     C        D      
     X  Y  Z  X  Y  Z
A B                  
1 2  3  4  5  6  7  8

To do this in a dynamic fashion, you could read the first two lines of the CSV as they are and loop through the columns you get to generate the names variable dynamically before loading the full dataset.

pd.read_csv(file, nrows=1, header=[0,1], index_col=[0, 1])

Then access the columns and loop to create your header. Again, not a very clean solution, but should work.

Pandas read multiindexed csv with blanks

Tags:

python

pandas

csv

python-2.7

Julien Marrec

2 Answers

unutbu

firelynx

Recent Activity

Donate For Us

Pandas read multiindexed csv with blanks

Tags:

python

pandas

csv

python-2.7

Julien Marrec

2 Answers

unutbu

firelynx

Related questions

Recent Activity

Donate For Us