Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use multiindexing vs. xarray in pandas

The pandas pivot tables documentation seems to recomend dealing with more than two dimensions of data by using multiindexing:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import pandas.util.testing as tm; tm.N = 3

In [4]: def unpivot(frame):
   ...:         N, K = frame.shape
   ...:         data = {'value' : frame.values.ravel('F'),
   ...:                 'variable' : np.asarray(frame.columns).repeat(N),
   ...:                 'date' : np.tile(np.asarray(frame.index), K)}
   ...:         return pd.DataFrame(data, columns=['date', 'variable', 'value'])
   ...: 

In [5]: df = unpivot(tm.makeTimeDataFrame())

In [6]: df
Out[6]: 
         date variable     value    value2
0  2000-01-03        A  0.462461  0.924921
1  2000-01-04        A -0.517911 -1.035823
2  2000-01-05        A  0.831014  1.662027
3  2000-01-03        B -0.492679 -0.985358
4  2000-01-04        B -1.234068 -2.468135
5  2000-01-05        B  1.725218  3.450437
6  2000-01-03        C  0.453859  0.907718
7  2000-01-04        C -0.763706 -1.527412
8  2000-01-05        C  0.839706  1.679413
9  2000-01-03        D -0.048108 -0.096216
10 2000-01-04        D  0.184461  0.368922
11 2000-01-05        D -0.349496 -0.698993

In [7]: df['value2'] = df['value'] * 2

In [8]: df.pivot('date', 'variable')
Out[8]: 
               value                                  value2            \
variable           A         B         C         D         A         B   
date                                                                     
2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463   
2000-01-04 -1.351152 -0.173595  0.470253 -1.181006 -2.702304 -0.347191   
2000-01-05  0.151067 -0.402517 -2.625085  1.275430  0.302135 -0.805035   


variable           C         D  
date                            
2000-01-03 -0.469259 -2.504964  
2000-01-04  0.940506 -2.362012  
2000-01-05 -5.250171  2.550861  

I thought that xarray was made for handling multidimensional datasets like this:

In [9]: import xarray as xr

In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)]))
Out[10]: 
<xarray.DataArray ()>
array({'A':         date     value    value2
0 2000-01-03  0.462461  0.924921
1 2000-01-04 -0.517911 -1.035823
2 2000-01-05  0.831014  1.662027, 'C':         date     value    value2
6 2000-01-03  0.453859  0.907718
7 2000-01-04 -0.763706 -1.527412
8 2000-01-05  0.839706  1.679413, 'B':         date     value    value2
3 2000-01-03 -0.492679 -0.985358
4 2000-01-04 -1.234068 -2.468135
5 2000-01-05  1.725218  3.450437, 'D':          date     value    value2
9  2000-01-03 -0.048108 -0.096216
10 2000-01-04  0.184461  0.368922
11 2000-01-05 -0.349496 -0.698993}, dtype=object)

Is one of these approaches better than the other? Why hasn't xarray completely replaced multiindexing?

like image 504
kilojoules Avatar asked Mar 18 '17 15:03

kilojoules


People also ask

Why use xarray instead of pandas?

The main distinguishing feature of xarray's DataArray over labeled arrays in pandas is that dimensions can have names (e.g., “time”, “latitude”, “longitude”). Names are much easier to keep track of than axis numbers, and xarray uses dimension names for indexing, aggregation and broadcasting.

Why use xarray?

xarray distinguishes itself from many tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.

What is an Xarray?

Xarray is a python package for working with labeled multi-dimensional (a.k.a. N-dimensional, ND) arrays, it includes functions for advanced analytics and visualization. Xarray is heavily inspired by pandas and it uses pandas internally.


1 Answers

There does seem to be a transition to xarray for doing work on multi-dimensional arrays. Pandas will be depreciating the support for the 3D Panels data structure and in the documentation even suggest using xarray for working with multidemensional arrays:

'Oftentimes, one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of Panel s main use cases. Here is a link to the xarray panel-transition documentation.'

From the xarray documentation they state their aims and goals:

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data...

...Our target audience is anyone who needs N-dimensional labelled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF

The main advantage of xarray over using straight numpy is that it makes use of labels in the same way pandas does over multiple dimensions. If you are working with 3-dimensional data using multi-indexing or xarray might be interchangeable. As the number of dimensions grows in your data set xarray becomes much more manageable. I cannot comment on how each performs in terms of efficiency or speed.

like image 72
Tkanno Avatar answered Sep 17 '22 19:09

Tkanno