Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High-dimensional data structure in Python

What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have been thinking of stuffs like dict of Panels, tuple as panel entries. I wonder if there is a high-dim panel thing in Python.

update 20/05/16: Thanks very much for all the answers. I have tried MultiIndex and xArray, however I am not able to comment on any of them. In my problem I will try to use ndarray instead as I found the label is not essential and I can save it separately.

update 16/09/16: I came up to use MultiIndex in the end. The ways to manipulate it are pretty tricky at first, but I kind of get used to it now.

like image 387
Wang Avatar asked May 18 '16 23:05

Wang


3 Answers

MultiIndex is most useful for higher dimensional data as explained in the docs and this SO answer because it allows you to work with any number of dimension in a DataFrame environment.

In addition to the Panel, there is also Panel4D - currently in experimental stage. Given the advantages of MultiIndex I wouldn't recommend using either this or the three dimensional version. I don't think these data structures have gained much traction in comparison, and will indeed be phased out.

like image 109
Stefan Avatar answered Sep 20 '22 01:09

Stefan


If you need labelled arrays and pandas-like smart indexing, you can use xarray package which is essentially an n-dimensional extension of pandas Panel (panels are being deprecated in pandas in future in favour of xarray).

Otherwise, it may sometimes be reasonable to use plain numpy arrays which can be of any dimensionality; you can also have arbitrarily nested numpy record arrays of any dimension.

like image 41
aldanor Avatar answered Sep 18 '22 01:09

aldanor


I recommend continuing to use DataFrame but utilize the MultiIndex feature. DataFrame is better supported and you preserve all of your dimensionality with the MultiIndex.

Example

df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['A', 'B'])

df3 = pd.concat([df for _ in [0, 1]], keys=['one', 'two'])

df4 = pd.concat([df3 for _ in [0, 1]], axis=1, keys=['One', 'Two'])

print df4

Looks like:

      One    Two   
        a  b   a  b
one A   1  2   1  2
    B   3  4   3  4
two A   1  2   1  2
    B   3  4   3  4

This is a hyper-cube of data. And you'll be much better served with support and questions and less bugs and many other benefits.

like image 29
piRSquared Avatar answered Sep 18 '22 01:09

piRSquared