Storing multidimensional arrays in pandas DataFrame columns

Tags:

pandas

I'm hoping to use pandas as the main Trace (series of points in parameter space from MCMC) object.

I have a list of dicts of string->array which I would like to store in pandas. The keys in the dicts are always the same, and for each key the shape of the numpy array is always the same, but the shape may be different for different keys and could have a different number of dimensions.

I had been using self.append(dict_list, ignore_index = True) which seems to work well for 1d values, but for nd>1 values pandas stores the values as objects which doesn't allow for nice plotting and other nice things. Any suggestions on how to get better behavior?

Sample data

point = {'x': array(-0.47652306228698005),
         'y': array([[-0.41809043],
                     [ 0.48407823]])}

points = 10 * [ point]

I'd like to be able to do something like

df = DataFrame(points)

df = DataFrame()
df.append(points, ignore_index=True)

and have

>> df['x'][1].shape
()
>> df['y'][1].shape 
(2,1)

985

asked Apr 04 '13 08:04

John Salvatier

3 Answers

The relatively-new library xray[1] has Dataset and DataArray structures that do exactly what you ask.

Here it is my take on your problem, written as an IPython session:

>>> import numpy as np
>>> import xray

>>> ## Prepare data:
>>> #
>>> point = {'x': np.array(-0.47652306228698005),
...          'y': np.array([[-0.41809043],
...                      [ 0.48407823]])}
>>> points = 10 * [point]

>>> ## Convert to Xray DataArrays:
>>> #
>>> list_x = [p['x'] for p in points]
>>> list_y = [p['y'] for p in points]
>>> da_x = xray.DataArray(list_x, [('x', range(len(list_x)))])
>>> da_y = xray.DataArray(list_y, [
...     ('x', range(len(list_y))),
...     ('y0', range(2)), 
...     ('y1', [0]), 
... ])

These are the two DataArray instances we built so far:

>>> print(da_x)
<xray.DataArray (x: 10)>
array([-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306,
       -0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9


>>> print(da_y.T) ## Transposed, to save lines.
<xray.DataArray (y1: 1, y0: 2, x: 10)>
array([[[-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043,
         -0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043],
        [ 0.48407823,  0.48407823,  0.48407823,  0.48407823,  0.48407823,
          0.48407823,  0.48407823,  0.48407823,  0.48407823,  0.48407823]]])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y0       (y0) int32 0 1
  * y1       (y1) int32 0

We can now merge these two DataArray on their common x dimension into a DataSet:

>>> ds = xray.Dataset({'X':da_x, 'Y':da_y})
>>> print(ds)
<xray.Dataset>
Dimensions:  (x: 10, y0: 2, y1: 1)
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y0       (y0) int32 0 1
  * y1       (y1) int32 0
Data variables:
    X        (x) float64 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 ...
    Y        (x, y0, y1) float64 -0.4181 0.4841 -0.4181 0.4841 -0.4181 0.4841 -0.4181 ...

And we can finally access and aggregate data the way you wanted:

>>> ds['X'].sum()
<xray.DataArray 'X' ()>
array(-4.765230622869801)


>>> ds['Y'].sum()
<xray.DataArray 'Y' ()>
array(0.659878)


>>> ds['Y'].sum(axis=1)
<xray.DataArray 'Y' (x: 10, y1: 1)>
array([[ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878],
       [ 0.0659878]])
Coordinates:
  * x        (x) int32 0 1 2 3 4 5 6 7 8 9
  * y1       (y1) int32 0

>>> np.all(ds['Y'].sum(axis=1) == ds['Y'].sum(dim='y0'))
True

>>>> ds['X'].sum(dim='y0')
Traceback (most recent call last):
ValueError: 'y0' not found in array dimensions ('x',)

[1] A library for handling N-dimensional data with labels, like pandas does for 2D: http://xray.readthedocs.org/en/stable/data-structures.html#dataset

answered Oct 17 '22 12:10

ankostis

Combining @Eike's answer and @JohnSalvatier's comment seems pretty Pandasonic:

>>> import pandas as pd
>>> np = pandas.np
>>> point = {'x': np.array(-0.47652306228698005),
...          'y': np.array([[-0.41809043],
...                         [ 0.48407823]])}
>>> points = 10 * [point]  # this creates a list of 10 point dicts
>>> df = pd.DataFrame().append(points)
>>> df.x
# 0    -0.476523062287
#   ...
# 9    -0.476523062287
# Name: x, dtype: object
>>> df.y
# 0    [[-0.41809043], [0.48407823]]
#   ...
# 9    [[-0.41809043], [0.48407823]]
# Name: y, dtype: object
>>> df.y[0]
# array([[-0.41809043],
#        [ 0.48407823]])
>>> df.y[0].shape
# (2, 1)

To plot (and do all the other cool 2-D Pandas things) you still have to manually convert the column of arrays back to a DataFrame:

>>> dfy = pd.DataFrame([row.T[0] for row in df2.y])
>>> dfy += np.matrix([[0] * 10, range(10)]).T
>>> dfy *= np.matrix([range(10), range(10)]).T
>>> dfy.plot()

example 2-D plot

To store this on disk, use to_pickle:

>>> df.to_pickle('/tmp/sotest.pickle')
>>> df2 = pd.read_pickle('/tmp/sotest.pickle')
>>> df.y[0].shape
# (2, 1)

If you use to_csv your np.arrays become strings:

>>> df.to_csv('/tmp/sotest.csv')
>>> df2 = pd.DataFrame.from_csv('/tmp/sotest.csv')
>>> df2.y[0]
# '[[-0.41809043]\n [ 0.48407823]]'

answered Oct 17 '22 12:10

hobs

It goes a bit against Pandas' philosophy, which seems to see Series as a one-dimensional data structure. Therefore you have to create the Series by hand, tell them that they have data type "object". This means don't apply any automatic data conversions.

You can do it like this (reordered Ipython session):

In [9]: import pandas as pd

In [1]: point = {'x': array(-0.47652306228698005),
   ...:          'y': array([[-0.41809043],
   ...:                      [ 0.48407823]])}

In [2]: points = 10 * [ point]

In [5]: lx = [p["x"] for p in points]

In [7]: ly = [p["y"] for p in points]

In [40]: sx = pd.Series(lx, dtype=numpy.dtype("object"))

In [38]: sy = pd.Series(ly, dtype=numpy.dtype("object"))

In [43]: df = pd.DataFrame({"x":sx, "y":sy})

In [45]: df['x'][1].shape
Out[45]: ()

In [46]: df['y'][1].shape
Out[46]: (2, 1)

answered Oct 17 '22 10:10

Eike

Related questions
                            
                                Working with subdomain in google app engine
                            
                                SyntaxError inconsistency in Python?
                            
                                Django adminsite customize search_fields query
                            
                                How find values in an array that meet two conditions using Python
                            
                                Python list to store class instance?
                            
                                AttributeError when unpickling an object
                            
                                Using variables in Python regular expression [duplicate]
                            
                                How can I make a unique value priority queue in Python?
                            
                                How do I change nesting function's variable in the nested function
                            
                                Terminal text becomes invisible after terminating subprocess
                            
                                Using Python Fabric without the command-line tool (fab)
                            
                                python lazy variables? or, delayed expensive computation
                            
                                Understanding weird boolean 2d-array indexing behavior in numpy
                            
                                Clearing specific cache in Django
                            
                                Scrapy and response status code: how to check against it?
                            
                                Shared library dependencies with distutils
                            
                                How do Homebrew, PIP, easy_install etc. work so that I can clean up
                            
                                How to get file name of current template inside Jinja2 template?
                            
                                Removing duplicates using custom comparisons
                            
                                Does 'not e in c' differ from 'e not in c' in Python? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With