Pandas dataframe without copy

Question

How can I avoid taking a copy of the dictionary supplied when creating a Pandas DataFrame?

>>> a = np.arange(10)
>>> b = np.arange(10.0)
>>> df1 = pd.DataFrame(a)
>>> a[0] = 100
>>> df1
     0
0  100
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
>>> d = {'a':a, 'b':b}
>>> df2 = pd.DataFrame(d)
>>> a[1] = 200
>>> d
{'a': array([100, 200,   2,   3,   4,   5,   6,   7,   8,   9]), 'b': array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9.])}
>>> df2
     a  b
0  100  0
1    1  1
2    2  2
3    3  3
4    4  4
5    5  5
6    6  6
7    7  7
8    8  8
9    9  9

If I create the dataframe from just a then changes to a are reflected in df (and vice versa).

Is there any way of making this work when supplying a dictionary?

Jeff · Accepted Answer

There is no way to 'share' a dict and have the frame update based on the dict changes. The copy argument is not relevant for a dict, data is always copied, because it is transformed to an ndarray.

However, there is a way to get this type of dynamic behavior in a limited way.

In [9]: arr = np.array(np.random.rand(5,2))

In [10]: df = DataFrame(arr)

In [11]: arr[0,0] = 0

In [12]: df
Out[12]: 
          0         1
0  0.000000  0.192056
1  0.847185  0.609028
2  0.833997  0.422521
3  0.937638  0.711856
4  0.047569  0.033282

Thus a passed ndarray will at construction time be a view onto the underlying numpy array. Depending on how you operate on the DataFrame you could trigger a copy (e.g. if you assign say a new column, or change a columns dtype). This will also only work for a single dtyped frame.

user48956 · Answer

It is possible to initialize a dataframe without copying the data. To understand how, you need to understand the BlockManager, which is the underlying datastructure used by DataFrame. It tries to group data of the same dtype together and hold their memory in a single block -- it does not function as as a columns of columns, as the documentation says. If the data is already provided as a single block, for example you initialize from a matrix:

        a = np.zeros((100,20))
        a.flags['WRITEABLE'] = False
        df = pd.DataFrame(a, copy=False)
        assert_read_only(df[df.columns[0]].iloc)

... then the DataFrame will usually just reference the ndarray.

However, this ain't gonna work if you're starting with multiple arrays or have heterogeneous types. In which case, you can monkey patch the BlockManager to force it not to consolidate same-typed data columns.

However, if you initialize your dataframe with non-numpy arrays, then pandas will immediately copy it.

Pandas dataframe without copy

Tags:

pandas

Andy Johnson

2 Answers

Jeff

user48956

Recent Activity

Donate For Us

Pandas dataframe without copy

Tags:

pandas

Andy Johnson

2 Answers

Jeff

user48956

Related questions

Recent Activity

Donate For Us