Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing multidimensional arrays in pandas DataFrame columns

Tags:

python

pandas

I'm hoping to use pandas as the main Trace (series of points in parameter space from MCMC) object.

I have a list of dicts of string->array which I would like to store in pandas. The keys in the dicts are always the same, and for each key the shape of the numpy array is always the same, but the shape may be different for different keys and could have a different number of dimensions.

I had been using self.append(dict_list, ignore_index = True) which seems to work well for 1d values, but for nd>1 values pandas stores the values as objects which doesn't allow for nice plotting and other nice things. Any suggestions on how to get better behavior?

Sample data

point = {'x': array(-0.47652306228698005),
         'y': array([[-0.41809043],
                     [ 0.48407823]])}

points = 10 * [ point]

I'd like to be able to do something like

df = DataFrame(points)

or

df = DataFrame()
df.append(points, ignore_index=True)

and have

>> df['x'][1].shape
()
>> df['y'][1].shape 
(2,1)
like image 985
John Salvatier Avatar asked Apr 04 '13 08:04

John Salvatier


People also ask

How do you store multidimensional arrays?

The total number of elements that can be stored in a multidimensional array can be calculated by multiplying the size of all the dimensions. For example: The array int x[10][20] can store total (10*20) = 200 elements. Similarly array int x[5][10][20] can store total (5*10*20) = 1000 elements.

How multidimensional arrays are stored in memory in data structure?

The data items in a multidimensional array are stored in the form of rows and columns. Also, the memory allocated for the multidimensional array is contiguous. So the elements in multidimensional arrays can be stored in linear storage using two methods i.e., row-major order or column-major order.

How does python handle multidimensional arrays?

In Python, Multidimensional Array can be implemented by fitting in a list function inside another list function, which is basically a nesting operation for the list function. Here, a list can have a number of values of any data type that are segregated by a delimiter like a comma.

Can multi dimensional arrays be indexed?

Indexing multi-dimensional arraysMulti-dimensional arrays are indexed in GAUSS the same way that matrices are indexed, using square brackets [] . Scanning above, you can see that the value of the element at the intersection of the third row and second column of x1 is 8.


3 Answers

The relatively-new library xray[1] has Dataset and DataArray structures that do exactly what you ask.

Here it is my take on your problem, written as an IPython session:

>>> import numpy as np
>>> import xray

>>> ## Prepare data:
>>> #
>>> point = {'x': np.array(-0.47652306228698005),
...          'y': np.array([[-0.41809043],
...                      [ 0.48407823]])}
>>> points = 10 * [point]

>>> ## Convert to Xray DataArrays:
>>> #
>>> list_x = [p['x'] for p in points]
>>> list_y = [p['y'] for p in points]
>>> da_x = xray.DataArray(list_x, [('x', range(len(list_x)))])
>>> da_y = xray.DataArray(list_y, [
...     ('x', range(len(list_y))),
...     ('y0', range(2)), 
...     ('y1', [0]), 
... ])

These are the two DataArray instances we built so far:

>>> print(da_x)
<xray.DataArray (x: 10)>
array([-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306,
       -0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9


>>> print(da_y.T) ## Transposed, to save lines.
<xray.DataArray (y1: 1, y0: 2, x: 10)>
array([[[-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043,
         -0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043],
        [ 0.48407823,  0.48407823,  0.48407823,  0.48407823,  0.48407823,
          0.48407823,  0.48407823,  0.48407823,  0.48407823,  0.48407823]]])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y0       (y0) int32 0 1
  * y1       (y1) int32 0

We can now merge these two DataArray on their common x dimension into a DataSet:

>>> ds = xray.Dataset({'X':da_x, 'Y':da_y})
>>> print(ds)
<xray.Dataset>
Dimensions:  (x: 10, y0: 2, y1: 1)
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y0       (y0) int32 0 1
  * y1       (y1) int32 0
Data variables:
    X        (x) float64 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 ...
    Y        (x, y0, y1) float64 -0.4181 0.4841 -0.4181 0.4841 -0.4181 0.4841 -0.4181 ...

And we can finally access and aggregate data the way you wanted:

>>> ds['X'].sum()
<xray.DataArray 'X' ()>
array(-4.765230622869801)


>>> ds['Y'].sum()
<xray.DataArray 'Y' ()>
array(0.659878)


>>> ds['Y'].sum(axis=1)
<xray.DataArray 'Y' (x: 10, y1: 1)>
array([[ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878]])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y1       (y1) int32 0

>>> np.all(ds['Y'].sum(axis=1) == ds['Y'].sum(dim='y0'))
True

>>>> ds['X'].sum(dim='y0')
Traceback (most recent call last):
ValueError: 'y0' not found in array dimensions ('x',)

[1] A library for handling N-dimensional data with labels, like pandas does for 2D: http://xray.readthedocs.org/en/stable/data-structures.html#dataset

like image 61
ankostis Avatar answered Oct 17 '22 12:10

ankostis


Combining @Eike's answer and @JohnSalvatier's comment seems pretty Pandasonic:

>>> import pandas as pd
>>> np = pandas.np
>>> point = {'x': np.array(-0.47652306228698005),
...          'y': np.array([[-0.41809043],
...                         [ 0.48407823]])}
>>> points = 10 * [point]  # this creates a list of 10 point dicts
>>> df = pd.DataFrame().append(points)
>>> df.x
# 0    -0.476523062287
#   ...
# 9    -0.476523062287
# Name: x, dtype: object
>>> df.y
# 0    [[-0.41809043], [0.48407823]]
#   ...
# 9    [[-0.41809043], [0.48407823]]
# Name: y, dtype: object
>>> df.y[0]
# array([[-0.41809043],
#        [ 0.48407823]])
>>> df.y[0].shape
# (2, 1)

To plot (and do all the other cool 2-D Pandas things) you still have to manually convert the column of arrays back to a DataFrame:

>>> dfy = pd.DataFrame([row.T[0] for row in df2.y])
>>> dfy += np.matrix([[0] * 10, range(10)]).T
>>> dfy *= np.matrix([range(10), range(10)]).T
>>> dfy.plot()

example 2-D plot

To store this on disk, use to_pickle:

>>> df.to_pickle('/tmp/sotest.pickle')
>>> df2 = pd.read_pickle('/tmp/sotest.pickle')
>>> df.y[0].shape
# (2, 1)

If you use to_csv your np.arrays become strings:

>>> df.to_csv('/tmp/sotest.csv')
>>> df2 = pd.DataFrame.from_csv('/tmp/sotest.csv')
>>> df2.y[0]
# '[[-0.41809043]\n [ 0.48407823]]'
like image 35
hobs Avatar answered Oct 17 '22 12:10

hobs


It goes a bit against Pandas' philosophy, which seems to see Series as a one-dimensional data structure. Therefore you have to create the Series by hand, tell them that they have data type "object". This means don't apply any automatic data conversions.

You can do it like this (reordered Ipython session):

In [9]: import pandas as pd

In [1]: point = {'x': array(-0.47652306228698005),
   ...:          'y': array([[-0.41809043],
   ...:                      [ 0.48407823]])}

In [2]: points = 10 * [ point]

In [5]: lx = [p["x"] for p in points]

In [7]: ly = [p["y"] for p in points]

In [40]: sx = pd.Series(lx, dtype=numpy.dtype("object"))

In [38]: sy = pd.Series(ly, dtype=numpy.dtype("object"))

In [43]: df = pd.DataFrame({"x":sx, "y":sy})

In [45]: df['x'][1].shape
Out[45]: ()

In [46]: df['y'][1].shape
Out[46]: (2, 1)
like image 41
Eike Avatar answered Oct 17 '22 10:10

Eike