I want to be able to create a Pandas DataFrame
with MultiIndexes for the rows and the columns index and read it from an ASCII text file. My data looks like:
col_indx = MultiIndex.from_tuples([('A', 'B', 'C'), ('A', 'B', 'C2'), ('A', 'B', 'C3'),
('A', 'B2', 'C'), ('A', 'B2', 'C2'), ('A', 'B2', 'C3'),
('A', 'B3', 'C'), ('A', 'B3', 'C2'), ('A', 'B3', 'C3'),
('A2', 'B', 'C'), ('A2', 'B', 'C2'), ('A2', 'B', 'C3'),
('A2', 'B2', 'C'), ('A2', 'B2', 'C2'), ('A2', 'B2', 'C3'),
('A2', 'B3', 'C'), ('A2', 'B3', 'C2'), ('A2', 'B3', 'C3')],
names=['one','two','three'])
row_indx = MultiIndex.from_tuples([(0, 'North', 'M'),
(1, 'East', 'F'),
(2, 'West', 'M'),
(3, 'South', 'M'),
(4, 'South', 'F'),
(5, 'West', 'F'),
(6, 'North', 'M'),
(7, 'North', 'M'),
(8, 'East', 'F'),
(9, 'South', 'M')],
names=['n', 'location', 'sex'])
size=len(row_indx), len(col_indx)
data = np.random.randint(0,10, size)
df = DataFrame(data, index=row_indx, columns=col_indx)
print df
I've tried df.to_csv()
and read_csv()
but they don't keep the index.
I was thinking of maybe creating a new format using extra delimeters. For example, using a row of ----------------
to mark the end of the column indexes and a |
to mark the end of a row index. So it would look like this:
one | A A A A A A A A A A2 A2 A2 A2 A2 A2 A2 A2 A2
two | B B B B2 B2 B2 B3 B3 B3 B B B B2 B2 B2 B3 B3 B3
three | C C2 C3 C C2 C3 C C2 C3 C C2 C3 C C2 C3 C C2 C3
--------------------------------------------------------------------------------------
n location sex :
0 North M | 2 3 9 1 0 6 5 9 5 9 4 4 0 9 6 2 6 1
1 East F | 6 2 9 2 7 0 0 3 7 4 8 1 3 2 1 7 7 5
2 West M | 5 8 9 7 6 0 3 0 2 5 0 3 9 6 7 3 4 9
3 South M | 6 2 3 6 4 0 4 0 1 9 3 6 2 1 0 6 9 3
4 South F | 9 6 0 0 6 1 7 0 8 1 7 6 2 0 8 1 5 3
5 West F | 7 9 7 8 2 0 4 3 8 9 0 3 4 9 2 5 1 7
6 North M | 3 3 5 7 9 4 2 6 3 2 7 5 5 5 6 4 2 9
7 North M | 7 4 8 6 8 4 5 7 9 0 2 9 1 9 7 9 5 6
8 East F | 1 6 5 3 6 4 6 9 6 9 2 4 2 9 8 4 2 4
9 South M | 9 6 6 1 3 1 3 5 7 4 8 6 7 7 8 9 2 3
Does Pandas have a way to write/read DataFrames to/from ASCII files with MultiIndexes?
from_tuples() function is used to convert list of tuples to MultiIndex. It is one of the several ways in which we construct a MultiIndex.
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.
The set_index() function is used to set the DataFrame index using existing columns. Set the DataFrame index (row labels) using one or more existing columns or arrays of the correct length. The index can replace the existing index or expand on it.
We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object.
Not sure which version of pandas you are using but with 0.7.3
you can export your DataFrame
to a TSV file and retain the indices by doing this:
df.to_csv('mydf.tsv', sep='\t')
The reason you need to export to TSV versus CSV is since the column headers have ,
characters in them. This should solve the first part of your question.
The second part gets a bit more tricky since from as far as I can tell, you need to beforehand have an idea of what you want your DataFrame to contain. In particular, you need to know:
MultiIndex
MultiIndex
To illustrate this, lets read back the TSV file we saved above into a new DataFrame
:
In [1]: t_df = read_table('mydf.tsv', index_col=[0,1,2])
In [2]: all(t_df.index == df.index)
Out[2]: True
So we managed to read mydf.tsv
into a DataFrame
that has the same row index as the original df
. But:
In [3]: all(t_df.columns == df.columns)
Out[3]: False
And the reason here is because pandas (as far as I can tell) has no way of parsing the header row correctly into a MultiIndex
. As I mentioned above, if you know beorehand that your TSV file header represents a MultiIndex
then you can do the following to fix this:
In [4]: from ast import literal_eval
In [5]: t_df.columns = MultiIndex.from_tuples(t_df.columns.map(literal_eval).tolist(),
names=['one','two','three'])
In [6]: all(t_df.columns == df.columns)
Out[6]: True
You can change the print options using set_option
:
display.multi_sparse
:: boolean
DefaultTrue
, "sparsify"MultiIndex
display
(don't display repeated elements in outer levels within groups)
Now the DataFrame will be printed as desired:
In [11]: pd.set_option('multi_sparse', False)
In [12]: df
Out[12]:
one A A A A A A A A A A2 A2 A2 A2 A2 A2 A2 A2 A2
two B B B B2 B2 B2 B3 B3 B3 B B B B2 B2 B2 B3 B3 B3
three C C2 C3 C C2 C3 C C2 C3 C C2 C3 C C2 C3 C C2 C3
n location sex
0 North M 2 1 6 4 6 4 7 1 1 0 4 3 9 2 0 0 6 4
1 East F 3 5 5 6 4 8 0 3 2 3 9 8 1 6 7 4 7 2
2 West M 7 9 3 5 0 1 2 8 1 6 0 7 9 9 3 2 2 4
3 South M 1 0 0 3 5 7 7 0 9 3 0 3 3 6 8 3 6 1
4 South F 8 0 0 7 3 8 0 8 0 5 5 6 0 0 0 1 8 7
5 West F 6 5 9 4 7 2 5 6 1 2 9 4 7 5 5 4 3 6
6 North M 3 3 0 1 1 3 6 3 8 6 4 1 0 5 5 5 4 9
7 North M 0 4 9 8 5 7 7 0 5 8 4 1 5 7 6 3 6 8
8 East F 5 6 2 7 0 6 2 7 1 2 0 5 6 1 4 8 0 3
9 South M 1 2 0 6 9 7 5 3 3 8 7 6 0 5 4 3 5 9
Note: in older pandas versions this was pd.set_printoptions(multi_sparse=False)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With