When to use multiindexing vs. xarray in pandas

Tags:

The pandas pivot tables documentation seems to recomend dealing with more than two dimensions of data by using multiindexing:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import pandas.util.testing as tm; tm.N = 3

In [4]: def unpivot(frame):
   ...:         N, K = frame.shape
   ...:         data = {'value' : frame.values.ravel('F'),
   ...:                 'variable' : np.asarray(frame.columns).repeat(N),
   ...:                 'date' : np.tile(np.asarray(frame.index), K)}
   ...:         return pd.DataFrame(data, columns=['date', 'variable', 'value'])
   ...: 

In [5]: df = unpivot(tm.makeTimeDataFrame())

In [6]: df
Out[6]: 
         date variable     value    value2
0  2000-01-03        A  0.462461  0.924921
1  2000-01-04        A -0.517911 -1.035823
2  2000-01-05        A  0.831014  1.662027
3  2000-01-03        B -0.492679 -0.985358
4  2000-01-04        B -1.234068 -2.468135
5  2000-01-05        B  1.725218  3.450437
6  2000-01-03        C  0.453859  0.907718
7  2000-01-04        C -0.763706 -1.527412
8  2000-01-05        C  0.839706  1.679413
9  2000-01-03        D -0.048108 -0.096216
10 2000-01-04        D  0.184461  0.368922
11 2000-01-05        D -0.349496 -0.698993

In [7]: df['value2'] = df['value'] * 2

In [8]: df.pivot('date', 'variable')
Out[8]: 
               value                                  value2            \
variable           A         B         C         D         A         B   
date                                                                     
2000-01-03 -1.558856 -1.144732 -0.234630 -1.252482 -3.117712 -2.289463   
2000-01-04 -1.351152 -0.173595  0.470253 -1.181006 -2.702304 -0.347191   
2000-01-05  0.151067 -0.402517 -2.625085  1.275430  0.302135 -0.805035   


variable           C         D  
date                            
2000-01-03 -0.469259 -2.504964  
2000-01-04  0.940506 -2.362012  
2000-01-05 -5.250171  2.550861

I thought that xarray was made for handling multidimensional datasets like this:

In [9]: import xarray as xr

In [10]: xr.DataArray(dict([(var, df[df.variable==var].drop('variable', 1)) for var in np.unique(df.variable)]))
Out[10]: 
<xarray.DataArray ()>
array({'A':         date     value    value2
0 2000-01-03  0.462461  0.924921
1 2000-01-04 -0.517911 -1.035823
2 2000-01-05  0.831014  1.662027, 'C':         date     value    value2
6 2000-01-03  0.453859  0.907718
7 2000-01-04 -0.763706 -1.527412
8 2000-01-05  0.839706  1.679413, 'B':         date     value    value2
3 2000-01-03 -0.492679 -0.985358
4 2000-01-04 -1.234068 -2.468135
5 2000-01-05  1.725218  3.450437, 'D':          date     value    value2
9  2000-01-03 -0.048108 -0.096216
10 2000-01-04  0.184461  0.368922
11 2000-01-05 -0.349496 -0.698993}, dtype=object)

Is one of these approaches better than the other? Why hasn't xarray completely replaced multiindexing?

504

asked Mar 18 '17 15:03

kilojoules

1 Answers

There does seem to be a transition to xarray for doing work on multi-dimensional arrays. Pandas will be depreciating the support for the 3D Panels data structure and in the documentation even suggest using xarray for working with multidemensional arrays:

'Oftentimes, one can simply use a MultiIndex DataFrame for easily working with higher dimensional data.

In addition, the xarray package was built from the ground up, specifically in order to support the multi-dimensional analysis that is one of Panel s main use cases. Here is a link to the xarray panel-transition documentation.'

From the xarray documentation they state their aims and goals:

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data...

...Our target audience is anyone who needs N-dimensional labelled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF

The main advantage of xarray over using straight numpy is that it makes use of labels in the same way pandas does over multiple dimensions. If you are working with 3-dimensional data using multi-indexing or xarray might be interchangeable. As the number of dimensions grows in your data set xarray becomes much more manageable. I cannot comment on how each performs in terms of efficiency or speed.

answered Sep 17 '22 19:09

Tkanno

Related questions
                            
                                wtforms hidden field value
                            
                                Is it possible to np.concatenate memory-mapped files?
                            
                                Python: Get unbound class method
                            
                                Trigger an event when clipboard content changes
                            
                                How do you improve matplotlib image quality?
                            
                                scipy ImportError on travis-ci
                            
                                Why am I getting a FileNotFoundError?
                            
                                Wrapping arrays in Boost Python
                            
                                IPython Notebook Tab-Complete -- Show Docstring
                            
                                Do Python strings end in a terminating NULL?
                            
                                Why does str(KeyError) add extra quotes?
                            
                                Pandas Equivalent of R's which()
                            
                                How to count top 10 most common values in a dict in python
                            
                                geodesic distance transform in python
                            
                                How does this Python 3 quine work?
                            
                                Python: List of lists to dictionary [closed]
                            
                                Flask app get "IOError: [Errno 32] Broken pipe"
                            
                                Why are mutable values in Python Enums the same object?
                            
                                Tensorflow Queues - Switching between train and validation data
                            
                                Equivalent of copyTo in Python OpenCV bindings?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When to use multiindexing vs. xarray in pandas

Tags:

python

pandas

data-structures

multi-index

xarray

kilojoules

People also ask

1 Answers

Tkanno

Recent Activity

Donate For Us