I'm hoping to use pandas as the main Trace (series of points in parameter space from MCMC) object.
I have a list of dicts of string->array which I would like to store in pandas. The keys in the dicts are always the same, and for each key the shape of the numpy array is always the same, but the shape may be different for different keys and could have a different number of dimensions.
I had been using self.append(dict_list, ignore_index = True)
which seems to work well for 1d values, but for nd>1 values pandas stores the values as objects which doesn't allow for nice plotting and other nice things. Any suggestions on how to get better behavior?
Sample data
point = {'x': array(-0.47652306228698005),
'y': array([[-0.41809043],
[ 0.48407823]])}
points = 10 * [ point]
I'd like to be able to do something like
df = DataFrame(points)
or
df = DataFrame()
df.append(points, ignore_index=True)
and have
>> df['x'][1].shape
()
>> df['y'][1].shape
(2,1)
The total number of elements that can be stored in a multidimensional array can be calculated by multiplying the size of all the dimensions. For example: The array int x[10][20] can store total (10*20) = 200 elements. Similarly array int x[5][10][20] can store total (5*10*20) = 1000 elements.
The data items in a multidimensional array are stored in the form of rows and columns. Also, the memory allocated for the multidimensional array is contiguous. So the elements in multidimensional arrays can be stored in linear storage using two methods i.e., row-major order or column-major order.
In Python, Multidimensional Array can be implemented by fitting in a list function inside another list function, which is basically a nesting operation for the list function. Here, a list can have a number of values of any data type that are segregated by a delimiter like a comma.
Indexing multi-dimensional arraysMulti-dimensional arrays are indexed in GAUSS the same way that matrices are indexed, using square brackets [] . Scanning above, you can see that the value of the element at the intersection of the third row and second column of x1 is 8.
The relatively-new library xray[1] has Dataset
and DataArray
structures that do exactly what you ask.
Here it is my take on your problem, written as an IPython session:
>>> import numpy as np
>>> import xray
>>> ## Prepare data:
>>> #
>>> point = {'x': np.array(-0.47652306228698005),
... 'y': np.array([[-0.41809043],
... [ 0.48407823]])}
>>> points = 10 * [point]
>>> ## Convert to Xray DataArrays:
>>> #
>>> list_x = [p['x'] for p in points]
>>> list_y = [p['y'] for p in points]
>>> da_x = xray.DataArray(list_x, [('x', range(len(list_x)))])
>>> da_y = xray.DataArray(list_y, [
... ('x', range(len(list_y))),
... ('y0', range(2)),
... ('y1', [0]),
... ])
These are the two DataArray
instances we built so far:
>>> print(da_x)
<xray.DataArray (x: 10)>
array([-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306,
-0.47652306, -0.47652306, -0.47652306, -0.47652306, -0.47652306])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
>>> print(da_y.T) ## Transposed, to save lines.
<xray.DataArray (y1: 1, y0: 2, x: 10)>
array([[[-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043,
-0.41809043, -0.41809043, -0.41809043, -0.41809043, -0.41809043],
[ 0.48407823, 0.48407823, 0.48407823, 0.48407823, 0.48407823,
0.48407823, 0.48407823, 0.48407823, 0.48407823, 0.48407823]]])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y0 (y0) int32 0 1
* y1 (y1) int32 0
We can now merge these two DataArray
on their common x
dimension into a DataSet
:
>>> ds = xray.Dataset({'X':da_x, 'Y':da_y})
>>> print(ds)
<xray.Dataset>
Dimensions: (x: 10, y0: 2, y1: 1)
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y0 (y0) int32 0 1
* y1 (y1) int32 0
Data variables:
X (x) float64 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 -0.4765 ...
Y (x, y0, y1) float64 -0.4181 0.4841 -0.4181 0.4841 -0.4181 0.4841 -0.4181 ...
And we can finally access and aggregate data the way you wanted:
>>> ds['X'].sum()
<xray.DataArray 'X' ()>
array(-4.765230622869801)
>>> ds['Y'].sum()
<xray.DataArray 'Y' ()>
array(0.659878)
>>> ds['Y'].sum(axis=1)
<xray.DataArray 'Y' (x: 10, y1: 1)>
array([[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878],
[ 0.0659878]])
Coordinates:
* x (x) int32 0 1 2 3 4 5 6 7 8 9
* y1 (y1) int32 0
>>> np.all(ds['Y'].sum(axis=1) == ds['Y'].sum(dim='y0'))
True
>>>> ds['X'].sum(dim='y0')
Traceback (most recent call last):
ValueError: 'y0' not found in array dimensions ('x',)
[1] A library for handling N-dimensional data with labels, like pandas does for 2D: http://xray.readthedocs.org/en/stable/data-structures.html#dataset
Combining @Eike's answer and @JohnSalvatier's comment seems pretty Pandasonic:
>>> import pandas as pd
>>> np = pandas.np
>>> point = {'x': np.array(-0.47652306228698005),
... 'y': np.array([[-0.41809043],
... [ 0.48407823]])}
>>> points = 10 * [point] # this creates a list of 10 point dicts
>>> df = pd.DataFrame().append(points)
>>> df.x
# 0 -0.476523062287
# ...
# 9 -0.476523062287
# Name: x, dtype: object
>>> df.y
# 0 [[-0.41809043], [0.48407823]]
# ...
# 9 [[-0.41809043], [0.48407823]]
# Name: y, dtype: object
>>> df.y[0]
# array([[-0.41809043],
# [ 0.48407823]])
>>> df.y[0].shape
# (2, 1)
To plot (and do all the other cool 2-D Pandas things) you still have to manually convert the column of arrays back to a DataFrame:
>>> dfy = pd.DataFrame([row.T[0] for row in df2.y])
>>> dfy += np.matrix([[0] * 10, range(10)]).T
>>> dfy *= np.matrix([range(10), range(10)]).T
>>> dfy.plot()
To store this on disk, use to_pickle
:
>>> df.to_pickle('/tmp/sotest.pickle')
>>> df2 = pd.read_pickle('/tmp/sotest.pickle')
>>> df.y[0].shape
# (2, 1)
If you use to_csv
your np.array
s become strings:
>>> df.to_csv('/tmp/sotest.csv')
>>> df2 = pd.DataFrame.from_csv('/tmp/sotest.csv')
>>> df2.y[0]
# '[[-0.41809043]\n [ 0.48407823]]'
It goes a bit against Pandas' philosophy, which seems to see Series
as a one-dimensional data structure. Therefore you have to create the Series
by hand, tell them that they have data type "object"
. This means don't apply any automatic data conversions.
You can do it like this (reordered Ipython session):
In [9]: import pandas as pd
In [1]: point = {'x': array(-0.47652306228698005),
...: 'y': array([[-0.41809043],
...: [ 0.48407823]])}
In [2]: points = 10 * [ point]
In [5]: lx = [p["x"] for p in points]
In [7]: ly = [p["y"] for p in points]
In [40]: sx = pd.Series(lx, dtype=numpy.dtype("object"))
In [38]: sy = pd.Series(ly, dtype=numpy.dtype("object"))
In [43]: df = pd.DataFrame({"x":sx, "y":sy})
In [45]: df['x'][1].shape
Out[45]: ()
In [46]: df['y'][1].shape
Out[46]: (2, 1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With