Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas dataframe without copy

Tags:

pandas

How can I avoid taking a copy of the dictionary supplied when creating a Pandas DataFrame?

>>> a = np.arange(10)
>>> b = np.arange(10.0)
>>> df1 = pd.DataFrame(a)
>>> a[0] = 100
>>> df1
     0
0  100
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
>>> d = {'a':a, 'b':b}
>>> df2 = pd.DataFrame(d)
>>> a[1] = 200
>>> d
{'a': array([100, 200,   2,   3,   4,   5,   6,   7,   8,   9]), 'b': array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])}
>>> df2
     a  b
0  100  0
1    1  1
2    2  2
3    3  3
4    4  4
5    5  5
6    6  6
7    7  7
8    8  8
9    9  9

If I create the dataframe from just a then changes to a are reflected in df (and vice versa).

Is there any way of making this work when supplying a dictionary?

like image 373
Andy Johnson Avatar asked Apr 30 '13 19:04

Andy Johnson


2 Answers

There is no way to 'share' a dict and have the frame update based on the dict changes. The copy argument is not relevant for a dict, data is always copied, because it is transformed to an ndarray.

However, there is a way to get this type of dynamic behavior in a limited way.

In [9]: arr = np.array(np.random.rand(5,2))

In [10]: df = DataFrame(arr)

In [11]: arr[0,0] = 0

In [12]: df
Out[12]: 
          0         1
0  0.000000  0.192056
1  0.847185  0.609028
2  0.833997  0.422521
3  0.937638  0.711856
4  0.047569  0.033282

Thus a passed ndarray will at construction time be a view onto the underlying numpy array. Depending on how you operate on the DataFrame you could trigger a copy (e.g. if you assign say a new column, or change a columns dtype). This will also only work for a single dtyped frame.

like image 143
Jeff Avatar answered Jan 03 '23 17:01

Jeff


It is possible to initialize a dataframe without copying the data. To understand how, you need to understand the BlockManager, which is the underlying datastructure used by DataFrame. It tries to group data of the same dtype together and hold their memory in a single block -- it does not function as as a columns of columns, as the documentation says. If the data is already provided as a single block, for example you initialize from a matrix:

        a = np.zeros((100,20))
        a.flags['WRITEABLE'] = False
        df = pd.DataFrame(a, copy=False)
        assert_read_only(df[df.columns[0]].iloc)

... then the DataFrame will usually just reference the ndarray.

However, this ain't gonna work if you're starting with multiple arrays or have heterogeneous types. In which case, you can monkey patch the BlockManager to force it not to consolidate same-typed data columns.

However, if you initialize your dataframe with non-numpy arrays, then pandas will immediately copy it.

like image 37
user48956 Avatar answered Jan 03 '23 17:01

user48956