The pandas.DataFrame.to_numpy
method has a copy
argument with the following documentation:
copy : bool, default False
Whether to ensure that the returned value is a not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.
Playing around a bit, it seems like calling to_numpy
on data that is both adjacent in memory and not of mixed types, keeps a view. But how do I check whether the resulting numpy array shares the memory with the data frame it was created from, without changing data?
Example of memory sharing:
import pandas as pd
import numpy as np
# some data frame that I expect not to be copied
frame = pd.DataFrame(np.arange(144).reshape(12,12))
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
# 0 1 2 3 4 5 6 7 8 9 10 11
# 0 0 0 0 0 0 0 0 0 0 0 0 0
# 1 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 0 0 0 0 0 0 0 0 0 0
# 3 0 0 0 0 0 0 0 0 0 0 0 0
# 4 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 0 0 0 0 0 0 0 0 0 0 0
# 6 0 0 0 0 0 0 0 0 0 0 0 0
# 7 0 0 0 0 0 0 0 0 0 0 0 0
# 8 0 0 0 0 0 0 0 0 0 0 0 0
# 9 0 0 0 0 0 0 0 0 0 0 0 0
# 10 0 0 0 0 0 0 0 0 0 0 0 0
# 11 0 0 0 0 0 0 0 0 0 0 0 0
Example not sharing memory:
import pandas as pd
import numpy as np
# some data frame that I expect to be copied
types = [int, str, float]
frame = pd.DataFrame({
i: [types[i%len(types)](value) for value in col]
for i, col in enumerate(np.arange(144).reshape(12,12).T)
})
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
# 0 1 2 3 4 5 6 7 8 9 10 11
# 0 0 12 24.0 36 48 60.0 72 84 96.0 108 120 132.0
# 1 1 13 25.0 37 49 61.0 73 85 97.0 109 121 133.0
# 2 2 14 26.0 38 50 62.0 74 86 98.0 110 122 134.0
# 3 3 15 27.0 39 51 63.0 75 87 99.0 111 123 135.0
# 4 4 16 28.0 40 52 64.0 76 88 100.0 112 124 136.0
# 5 5 17 29.0 41 53 65.0 77 89 101.0 113 125 137.0
# 6 6 18 30.0 42 54 66.0 78 90 102.0 114 126 138.0
# 7 7 19 31.0 43 55 67.0 79 91 103.0 115 127 139.0
# 8 8 20 32.0 44 56 68.0 80 92 104.0 116 128 140.0
# 9 9 21 33.0 45 57 69.0 81 93 105.0 117 129 141.0
# 10 10 22 34.0 46 58 70.0 82 94 106.0 118 130 142.0
# 11 11 23 35.0 47 59 71.0 83 95 107.0 119 131 143.0
Pandas DataFrame copy() Method The copy() method returns a copy of the DataFrame. By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.
To create a shallow copy of Pandas DataFrame, use the df. copy(deep=False) method. Pandas DataFrame copy() function makes a copy of this object's indices and data. When deep=True (default), the new object will be created with a copy of the calling object's data and indices.
You can use the attribute df. empty to check whether it's empty or not: if df. empty: print('DataFrame is empty!
to_numpy() function is used to return a NumPy ndarray representing the values in given Series or Index. This function will explain how we can convert the pandas Series to numpy Array.
There is numpy.shares_memory you can use:
# Your first example
print(np.shares_memory(array, frame)) # True, they are sharing memory
# Your second example
print(np.shares_memory(array2, frame2)) # False, they are not sharing memory
There is also numpy.may_share_memory, which is faster but can only be used for making sure things do not share memory (because it only checks whether the bounds overlap), so strictly speaking does not answer the question. Read this for the differences.
Take care using these numpy functions with pandas data-structures:
np.shares_memory(frame, frame)
returns True
for the first example, but False
for the second, probably because the __array__
method of the data frame in the second example creates a copy behind the scenes.
In your first case you make the frame from an array. The source array is used 'as-is' as the data for the frame. That is, the frame just adds its indices and methods to the original array:
In [377]: arr = np.arange(12).reshape(3,4)
In [378]: df = pd.DataFrame(arr)
In [379]: df
Out[379]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
In [380]: arr1 = df.to_numpy()
In [381]: arr1
Out[381]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
I like to compare arrays with the array_interface
dictionary. Note that the data
is identical in both:
In [382]: arr.__array_interface__
Out[382]:
{'data': (53291792, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 4),
'version': 3}
In [383]: arr1.__array_interface__
Out[383]:
{'data': (53291792, False),
'strides': None,
'descr': [('', '<i8')],
'typestr': '<i8',
'shape': (3, 4),
'version': 3}
I could do the mutation test as well.
In the second case you make the frame from a dictionary. I suspect in this case the frame is actually a collection of pd.Series
, though I'm not sure how to test that.
In [393]: df1 = pd.DataFrame({'a':np.arange(3), 'b':np.ones(3)})
In [394]: df1
Out[394]:
a b
0 0 1.0
1 1 1.0
2 2 1.0
In [395]: x = df1.to_numpy()
In [396]: x
Out[396]:
array([[0., 1.],
[1., 1.],
[2., 1.]])
The change in dtypes
is good indication that x
is a copy. Columns of df1
differ in dtype, while x
is all float.
And with the mutating test:
In [397]: x *= 0
In [398]: df1
Out[398]:
a b
0 0 1.0
1 1 1.0
2 2 1.0
on the other hand, constructing the same frame but with all floats, the array isn't a copy:
In [399]: df1 = pd.DataFrame({'a':np.arange(3.), 'b':np.ones(3)})
In [400]: df1
Out[400]:
a b
0 0.0 1.0
1 1.0 1.0
2 2.0 1.0
In [401]: x = df1.to_numpy()
In [402]: x *= 0
In [403]: df1
Out[403]:
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
Others have suggested looking at the flags. I'm not sure that's reliable. I checked the [396] case, and x
did not owndata
.
I probably haven't added much to your observations. I think we need to dig more into how a frame stores its data. That may depend, not only on how the frame was constructed, but also on how it was modified (for example, what happens when I add a column?).
df.to_numpy
is just np.array(self.values, dtype=dtype, copy=copy)
. At this level, whether it's a copy or not depends on the dtype
conversion, if any.
df.values
is a property that does:
self._consolidate_inplace()
return self._data.as_array(transpose=self._AXIS_REVERSED)
df._data
is a BlockManager
(at least in my examples)
If this is a single_block
, its as_array
does
np.asarray(mgr.blocks[0].get_values())
I was going to show the BlockMangers for the different dataframes, but just lost that interactive Ipython session.
The [379] frame has just one integer block; the [394] frame has two, one float, one integer.
In any case, there's a lot of pandas code behind the to_numpy()
method. And much of it depends on exactly how the data is stored for that frame. So I don't think there's a simple surefire way of identifying whether an array is a copy or not. Except in simple, uniform dataframe cases, it's better to assume it's a copy. But be wary of modifying the array if you don't want to modify the frame.
Use df.to_numpy(copy=True)
to be sure that you get a copy.
I don't think you can be sure about getting a view. If the df has a uniform, matching dtype, there's a good chance it's a view, especially if the construction wasn't too convoluted.
====
In [2]: df = pd.DataFrame(np.ones((3,4),int))
In [3]: df
Out[3]:
0 1 2 3
0 1 1 1 1
1 1 1 1 1
2 1 1 1 1
In [4]: df.to_numpy().flags
Out[4]:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False <====
...
In [5]: df.to_numpy(copy=True).flags
Out[5]:
...
OWNDATA : True
Now a frame with mixed dtypes:
In [7]: df1 = pd.DataFrame({'a':np.arange(3), 'b':np.ones(3)})
In [8]: df1
Out[8]:
a b
0 0 1.0
1 1 1.0
2 2 1.0
This is a copy, but doesn't owndata. Note that this is F_CONTIGUOUS
; I think that means there's a transpose in the generation code, which would account for the False owndata:
In [10]: df1.to_numpy().flags
Out[10]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
...
In [11]: df1.to_numpy()
Out[11]:
array([[0., 1.],
[1., 1.],
[2., 1.]])
The BlockManager has two blocks, one for each dtype:
In [12]: df1._data
Out[12]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(1, 2, 1), 1 x 3, dtype: float64
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64
df1.values
is:
return self._data.as_array(transpose=self._AXIS_REVERSED)
as_array
without transpose and with:
In [14]: df1._data.as_array()
Out[14]:
array([[0., 1., 2.],
[1., 1., 1.]])
In [15]: df1._data.as_array(transpose=True)
Out[15]:
array([[0., 1.],
[1., 1.],
[2., 1.]])
So to_numpy
uses np.array(values)
with the potential of copy
and dtype
. values
passes the task BlockManager, which does at least one np.asarray()
and a (probable) transpose
. If there are more than one block, it does an _interleave
(which I haven't explored).
So while to_numpy(copy=True)
ensures a copy, it's harder to predict/detect whether processing up to that point has created a copy or not.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With