Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to write/read a Pandas DataFrame with MultiIndex from/to an ASCII file?

Tags:

python

pandas

I want to be able to create a Pandas DataFrame with MultiIndexes for the rows and the columns index and read it from an ASCII text file. My data looks like:

col_indx = MultiIndex.from_tuples([('A',  'B',  'C'), ('A',  'B',  'C2'), ('A',  'B',  'C3'), 
                                   ('A',  'B2', 'C'), ('A',  'B2', 'C2'), ('A',  'B2', 'C3'), 
                                   ('A',  'B3', 'C'), ('A',  'B3', 'C2'), ('A',  'B3', 'C3'), 
                                   ('A2', 'B',  'C'), ('A2', 'B',  'C2'), ('A2', 'B',  'C3'), 
                                   ('A2', 'B2', 'C'), ('A2', 'B2', 'C2'), ('A2', 'B2', 'C3'), 
                                   ('A2', 'B3', 'C'), ('A2', 'B3', 'C2'), ('A2', 'B3', 'C3')], 
                                   names=['one','two','three']) 
row_indx = MultiIndex.from_tuples([(0,  'North', 'M'), 
                                   (1,  'East',  'F'), 
                                   (2,  'West',  'M'), 
                                   (3,  'South', 'M'), 
                                   (4,  'South', 'F'), 
                                   (5,  'West',  'F'), 
                                   (6,  'North', 'M'), 
                                   (7,  'North', 'M'), 
                                   (8,  'East',  'F'), 
                                   (9,  'South', 'M')], 
                                   names=['n', 'location', 'sex'])
size=len(row_indx), len(col_indx)
data = np.random.randint(0,10, size)
df = DataFrame(data, index=row_indx, columns=col_indx)
print df

I've tried df.to_csv() and read_csv() but they don't keep the index.

I was thinking of maybe creating a new format using extra delimeters. For example, using a row of ---------------- to mark the end of the column indexes and a | to mark the end of a row index. So it would look like this:

one            | A   A   A   A   A   A   A   A   A  A2  A2  A2  A2  A2  A2  A2  A2  A2
two            | B   B   B  B2  B2  B2  B3  B3  B3   B   B   B  B2  B2  B2  B3  B3  B3
three          | C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3
--------------------------------------------------------------------------------------
n location sex :                                                                      
0 North    M   | 2   3   9   1   0   6   5   9   5   9   4   4   0   9   6   2   6   1
1 East     F   | 6   2   9   2   7   0   0   3   7   4   8   1   3   2   1   7   7   5
2 West     M   | 5   8   9   7   6   0   3   0   2   5   0   3   9   6   7   3   4   9
3 South    M   | 6   2   3   6   4   0   4   0   1   9   3   6   2   1   0   6   9   3
4 South    F   | 9   6   0   0   6   1   7   0   8   1   7   6   2   0   8   1   5   3
5 West     F   | 7   9   7   8   2   0   4   3   8   9   0   3   4   9   2   5   1   7
6 North    M   | 3   3   5   7   9   4   2   6   3   2   7   5   5   5   6   4   2   9
7 North    M   | 7   4   8   6   8   4   5   7   9   0   2   9   1   9   7   9   5   6
8 East     F   | 1   6   5   3   6   4   6   9   6   9   2   4   2   9   8   4   2   4
9 South    M   | 9   6   6   1   3   1   3   5   7   4   8   6   7   7   8   9   2   3

Does Pandas have a way to write/read DataFrames to/from ASCII files with MultiIndexes?

like image 749
dailyglen Avatar asked Jun 14 '12 21:06

dailyglen


People also ask

What does the pandas function MultiIndex From_tuples do?

from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.

What is Panda MultiIndex?

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.

How do you give a DataFrame an index value?

The set_index() function is used to set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays of the correct length. The index can replace the existing index or expand on it.

Which of the function from pandas reads the dataset from a large text file?

We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object.


2 Answers

Not sure which version of pandas you are using but with 0.7.3 you can export your DataFrame to a TSV file and retain the indices by doing this:

df.to_csv('mydf.tsv', sep='\t')

The reason you need to export to TSV versus CSV is since the column headers have , characters in them. This should solve the first part of your question.

The second part gets a bit more tricky since from as far as I can tell, you need to beforehand have an idea of what you want your DataFrame to contain. In particular, you need to know:

  1. Which columns on your TSV represent the row MultiIndex
  2. and that the rest of the columns should also be converted to a MultiIndex

To illustrate this, lets read back the TSV file we saved above into a new DataFrame:

In [1]: t_df = read_table('mydf.tsv', index_col=[0,1,2])
In [2]: all(t_df.index == df.index)
Out[2]: True

So we managed to read mydf.tsv into a DataFrame that has the same row index as the original df. But:

In [3]: all(t_df.columns == df.columns)
Out[3]: False

And the reason here is because pandas (as far as I can tell) has no way of parsing the header row correctly into a MultiIndex. As I mentioned above, if you know beorehand that your TSV file header represents a MultiIndex then you can do the following to fix this:

In [4]: from ast import literal_eval
In [5]: t_df.columns = MultiIndex.from_tuples(t_df.columns.map(literal_eval).tolist(), 
                                              names=['one','two','three'])
In [6]: all(t_df.columns == df.columns)
Out[6]: True
like image 129
diliop Avatar answered Sep 17 '22 23:09

diliop


You can change the print options using set_option:

display.multi_sparse:
: boolean
   Default True, "sparsify" MultiIndex display
   (don't display repeated elements in outer levels within groups)

Now the DataFrame will be printed as desired:

In [11]: pd.set_option('multi_sparse', False)

In [12]: df
Out[12]: 
one             A   A   A   A   A   A   A   A   A  A2  A2  A2  A2  A2  A2  A2  A2  A2
two             B   B   B  B2  B2  B2  B3  B3  B3   B   B   B  B2  B2  B2  B3  B3  B3
three           C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3   C  C2  C3
n location sex                                                                       
0 North    M    2   1   6   4   6   4   7   1   1   0   4   3   9   2   0   0   6   4
1 East     F    3   5   5   6   4   8   0   3   2   3   9   8   1   6   7   4   7   2
2 West     M    7   9   3   5   0   1   2   8   1   6   0   7   9   9   3   2   2   4
3 South    M    1   0   0   3   5   7   7   0   9   3   0   3   3   6   8   3   6   1
4 South    F    8   0   0   7   3   8   0   8   0   5   5   6   0   0   0   1   8   7
5 West     F    6   5   9   4   7   2   5   6   1   2   9   4   7   5   5   4   3   6
6 North    M    3   3   0   1   1   3   6   3   8   6   4   1   0   5   5   5   4   9
7 North    M    0   4   9   8   5   7   7   0   5   8   4   1   5   7   6   3   6   8
8 East     F    5   6   2   7   0   6   2   7   1   2   0   5   6   1   4   8   0   3
9 South    M    1   2   0   6   9   7   5   3   3   8   7   6   0   5   4   3   5   9

Note: in older pandas versions this was pd.set_printoptions(multi_sparse=False).

like image 45
Andy Hayden Avatar answered Sep 21 '22 23:09

Andy Hayden