How to find out `DataFrame.to_numpy` did not create a copy

Tags:

The pandas.DataFrame.to_numpy method has a copy argument with the following documentation:

copy : bool, default False

Whether to ensure that the returned value is a not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

Playing around a bit, it seems like calling to_numpy on data that is both adjacent in memory and not of mixed types, keeps a view. But how do I check whether the resulting numpy array shares the memory with the data frame it was created from, without changing data?

Example of memory sharing:

import pandas as pd
import numpy as np

# some data frame that I expect not to be copied
frame = pd.DataFrame(np.arange(144).reshape(12,12))
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
#     0  1  2  3  4  5  6  7  8  9  10  11
# 0   0  0  0  0  0  0  0  0  0  0   0   0
# 1   0  0  0  0  0  0  0  0  0  0   0   0
# 2   0  0  0  0  0  0  0  0  0  0   0   0
# 3   0  0  0  0  0  0  0  0  0  0   0   0
# 4   0  0  0  0  0  0  0  0  0  0   0   0
# 5   0  0  0  0  0  0  0  0  0  0   0   0
# 6   0  0  0  0  0  0  0  0  0  0   0   0
# 7   0  0  0  0  0  0  0  0  0  0   0   0
# 8   0  0  0  0  0  0  0  0  0  0   0   0
# 9   0  0  0  0  0  0  0  0  0  0   0   0
# 10  0  0  0  0  0  0  0  0  0  0   0   0
# 11  0  0  0  0  0  0  0  0  0  0   0   0

Example not sharing memory:

import pandas as pd
import numpy as np

# some data frame that I expect to be copied
types = [int, str, float]
frame = pd.DataFrame({
    i: [types[i%len(types)](value) for value in col]
    for i, col in enumerate(np.arange(144).reshape(12,12).T)
})
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
#     0   1     2   3   4     5   6   7      8    9    10     11
# 0    0  12  24.0  36  48  60.0  72  84   96.0  108  120  132.0
# 1    1  13  25.0  37  49  61.0  73  85   97.0  109  121  133.0
# 2    2  14  26.0  38  50  62.0  74  86   98.0  110  122  134.0
# 3    3  15  27.0  39  51  63.0  75  87   99.0  111  123  135.0
# 4    4  16  28.0  40  52  64.0  76  88  100.0  112  124  136.0
# 5    5  17  29.0  41  53  65.0  77  89  101.0  113  125  137.0
# 6    6  18  30.0  42  54  66.0  78  90  102.0  114  126  138.0
# 7    7  19  31.0  43  55  67.0  79  91  103.0  115  127  139.0
# 8    8  20  32.0  44  56  68.0  80  92  104.0  116  128  140.0
# 9    9  21  33.0  45  57  69.0  81  93  105.0  117  129  141.0
# 10  10  22  34.0  46  58  70.0  82  94  106.0  118  130  142.0
# 11  11  23  35.0  47  59  71.0  83  95  107.0  119  131  143.0

469

asked Jun 10 '20 12:06

Martin

2 Answers

There is numpy.shares_memory you can use:

# Your first example
print(np.shares_memory(array, frame))  # True, they are sharing memory

# Your second example
print(np.shares_memory(array2, frame2))  # False, they are not sharing memory

There is also numpy.may_share_memory, which is faster but can only be used for making sure things do not share memory (because it only checks whether the bounds overlap), so strictly speaking does not answer the question. Read this for the differences.

Take care using these numpy functions with pandas data-structures: np.shares_memory(frame, frame) returns True for the first example, but False for the second, probably because the __array__ method of the data frame in the second example creates a copy behind the scenes.

answered Oct 25 '22 14:10

ywbaek

In your first case you make the frame from an array. The source array is used 'as-is' as the data for the frame. That is, the frame just adds its indices and methods to the original array:

In [377]: arr = np.arange(12).reshape(3,4)                                                    
In [378]: df = pd.DataFrame(arr)                                                              
In [379]: df                                                                                  
Out[379]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
In [380]: arr1 = df.to_numpy()                                                                
In [381]: arr1                                                                                
Out[381]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

I like to compare arrays with the array_interface dictionary. Note that the data is identical in both:

In [382]: arr.__array_interface__                                                             
Out[382]: 
{'data': (53291792, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (3, 4),
 'version': 3}
In [383]: arr1.__array_interface__                                                            
Out[383]: 
{'data': (53291792, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (3, 4),
 'version': 3}

I could do the mutation test as well.

In the second case you make the frame from a dictionary. I suspect in this case the frame is actually a collection of pd.Series, though I'm not sure how to test that.

In [393]: df1 = pd.DataFrame({'a':np.arange(3), 'b':np.ones(3)})                              
In [394]: df1                                                                                 
Out[394]: 
   a    b
0  0  1.0
1  1  1.0
2  2  1.0
In [395]: x = df1.to_numpy()                                                                  
In [396]: x                                                                                   
Out[396]: 
array([[0., 1.],
       [1., 1.],
       [2., 1.]])

The change in dtypes is good indication that x is a copy. Columns of df1 differ in dtype, while x is all float.

And with the mutating test:

In [397]: x *= 0                                                                              
In [398]: df1                                                                                 
Out[398]: 
   a    b
0  0  1.0
1  1  1.0
2  2  1.0

on the other hand, constructing the same frame but with all floats, the array isn't a copy:

In [399]: df1 = pd.DataFrame({'a':np.arange(3.), 'b':np.ones(3)})                             
In [400]: df1                                                                                 
Out[400]: 
     a    b
0  0.0  1.0
1  1.0  1.0
2  2.0  1.0
In [401]: x = df1.to_numpy()                                                                  
In [402]: x *= 0                                                                              
In [403]: df1                                                                                 
Out[403]: 
     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  0.0

Others have suggested looking at the flags. I'm not sure that's reliable. I checked the [396] case, and x did not owndata.

I probably haven't added much to your observations. I think we need to dig more into how a frame stores its data. That may depend, not only on how the frame was constructed, but also on how it was modified (for example, what happens when I add a column?).

df.to_numpy is just np.array(self.values, dtype=dtype, copy=copy). At this level, whether it's a copy or not depends on the dtype conversion, if any.

df.values is a property that does:

self._consolidate_inplace()
return self._data.as_array(transpose=self._AXIS_REVERSED)

df._data is a BlockManager (at least in my examples)

If this is a single_block, its as_array does

np.asarray(mgr.blocks[0].get_values())

I was going to show the BlockMangers for the different dataframes, but just lost that interactive Ipython session.

The [379] frame has just one integer block; the [394] frame has two, one float, one integer.

In any case, there's a lot of pandas code behind the to_numpy() method. And much of it depends on exactly how the data is stored for that frame. So I don't think there's a simple surefire way of identifying whether an array is a copy or not. Except in simple, uniform dataframe cases, it's better to assume it's a copy. But be wary of modifying the array if you don't want to modify the frame.

Use df.to_numpy(copy=True) to be sure that you get a copy.

I don't think you can be sure about getting a view. If the df has a uniform, matching dtype, there's a good chance it's a view, especially if the construction wasn't too convoluted.

====

In [2]: df = pd.DataFrame(np.ones((3,4),int))                                                                   
In [3]: df                                                                                                      
Out[3]: 
   0  1  2  3
0  1  1  1  1
1  1  1  1  1
2  1  1  1  1
In [4]: df.to_numpy().flags                                                                                     
Out[4]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False              <====
  ...
In [5]: df.to_numpy(copy=True).flags                                                                            
Out[5]: 
  ...
  OWNDATA : True

Now a frame with mixed dtypes:

In [7]: df1 = pd.DataFrame({'a':np.arange(3), 'b':np.ones(3)})                                                  
In [8]: df1                                                                                                     
Out[8]: 
   a    b
0  0  1.0
1  1  1.0
2  2  1.0

This is a copy, but doesn't owndata. Note that this is F_CONTIGUOUS; I think that means there's a transpose in the generation code, which would account for the False owndata:

In [10]: df1.to_numpy().flags                                                                                   
Out[10]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...
In [11]: df1.to_numpy()                                                                                         
Out[11]: 
array([[0., 1.],
       [1., 1.],
       [2., 1.]])

The BlockManager has two blocks, one for each dtype:

In [12]: df1._data                                                                                              
Out[12]: 
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(1, 2, 1), 1 x 3, dtype: float64
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64

df1.values is:

return self._data.as_array(transpose=self._AXIS_REVERSED)

as_array without transpose and with:

In [14]: df1._data.as_array()                                                                                   
Out[14]: 
array([[0., 1., 2.],
       [1., 1., 1.]])
In [15]: df1._data.as_array(transpose=True)                                                                     
Out[15]: 
array([[0., 1.],
       [1., 1.],
       [2., 1.]])

So to_numpy uses np.array(values) with the potential of copy and dtype. values passes the task BlockManager, which does at least one np.asarray() and a (probable) transpose. If there are more than one block, it does an _interleave (which I haven't explored).

So while to_numpy(copy=True) ensures a copy, it's harder to predict/detect whether processing up to that point has created a copy or not.

answered Oct 25 '22 14:10

hpaulj

Related questions
                            
                                Django Allauth seems to log user out after a few days of inactivity
                            
                                Tensorflow NotFoundError: libtensorflow_framework.so: cannot open shared file or directory
                            
                                Keras vs PyTorch LSTM different results
                            
                                Why is `len(l) != 0` faster than `bool(l)` in CPython?
                            
                                Python Requests Stream Data from API
                            
                                Plotting issue (matplotlib): "ValueError: posx and posy should be finite values"
                            
                                What the difference between read() and read1() in Python?
                            
                                Why using numpy.random.seed is not a good practice?
                            
                                Why does float.__repr__ return a different representation compared to the equivalent formatting option?
                            
                                PySpark; DecimalType multiplication precision loss
                            
                                Speeding up pandas profiling analysis using check_correlation?
                            
                                Where and in what context did Guido van Rossum say "If you want your code to run faster, you should probably just use PyPy."? [closed]
                            
                                Remove non straight lines from text image
                            
                                How to get current learning rate of SGD optimizer in TensorFlow 2.0 when I use tf.keras.optimizers.schedules.ExponentialDecay?
                            
                                Python OpenCV line detection to detect `X` symbol in image
                            
                                Why does reading an image from OpenCV python samples giving error where as it does not give error in c++?
                            
                                Warning: failed to read path from javaldx
                            
                                Quartiles line properties in seaborn violinplot
                            
                                List of dicts to multilevel dict based on depth info
                            
                                cv2.approxPolyDP() , cv2.arcLength() How these works

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find out `DataFrame.to_numpy` did not create a copy

Tags:

python

pandas

numpy

Martin

People also ask

2 Answers

ywbaek

hpaulj

Recent Activity

Donate For Us