MultiIndexing rows vs. columns in pandas DataFrame

Question

I am working with multiindexing dataframe in pandas and am wondering whether I should multiindex the rows or the columns.

My data looks something like this: DataTable

Code:

import numpy as np
import pandas as pd
arrays = pd.tools.util.cartesian_product([['condition1', 'condition2'], 
                                          ['patient1', 'patient2'],
                                          ['measure1', 'measure2', 'measure3']])
colidxs = pd.MultiIndex.from_arrays(arrays, 
                                    names=['condition', 'patient', 'measure'])
rowidxs = pd.Index([0,1,2,3], name='time')
data = pd.DataFrame(np.random.randn(len(rowidxs), len(colidxs)), 
                    index=rowidxs, columns=colidxs)

Here I choose to multiindex the column, with the rationale that pandas dataframe consists of series, and my data ultimately is a bunch of time series (hence row-indexed by time here).

I have this question because it seems there is some asymmetry between rows and columns for multiindexing. For example, in this document webpage it shows how query works for row-multiindexed dataframe, but if the dataframe is column-multiindexed then the command in the document has to be replaced by something like df.T.query('color == "red"').T.

My question might seem a bit silly, but I'd like to see if there is any difference in convenience between multiindexing rows vs. columns for dataframes (such as the query case above).

Thanks.

Lei · Accepted Answer

A rough personal summary of what I call the row/column-propensity of some common operations for DataFrame:

[]: column-first
get: column-only
attribute accessing as indexing: column-only
query: row-only
loc, iloc, ix: row-first
xs: row-first
sortlevel: row-first
groupby: row-first

"row-first" means the operation expects row index as the first argument, and to operate on column index one needs to use [:, ] or specify axis=1;
"row-only" means the operation only works for row index and one has to do something like transposing the dataframe to operate on the column index.

Based on this, it seems multiindexing rows is slightly more convenient.

A natural question of mine: why don't pandas developers unify the row/column propensity of DataFrame operations? For example, that [] and loc/iloc/ix are two most common ways of indexing dataframes but one slices columns and the others slice rows seems a bit odd.

MultiIndexing rows vs. columns in pandas DataFrame

Tags:

python

pandas

numpy

multi-index

Lei

1 Answers

Lei

Recent Activity

Donate For Us

MultiIndexing rows vs. columns in pandas DataFrame

Tags:

python

pandas

numpy

multi-index

Lei

1 Answers

Lei

Related questions

Recent Activity

Donate For Us