Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find out `DataFrame.to_numpy` did not create a copy

The pandas.DataFrame.to_numpy method has a copy argument with the following documentation:

copy : bool, default False

Whether to ensure that the returned value is a not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

Playing around a bit, it seems like calling to_numpy on data that is both adjacent in memory and not of mixed types, keeps a view. But how do I check whether the resulting numpy array shares the memory with the data frame it was created from, without changing data?

Example of memory sharing:

import pandas as pd
import numpy as np

# some data frame that I expect not to be copied
frame = pd.DataFrame(np.arange(144).reshape(12,12))
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
#     0  1  2  3  4  5  6  7  8  9  10  11
# 0   0  0  0  0  0  0  0  0  0  0   0   0
# 1   0  0  0  0  0  0  0  0  0  0   0   0
# 2   0  0  0  0  0  0  0  0  0  0   0   0
# 3   0  0  0  0  0  0  0  0  0  0   0   0
# 4   0  0  0  0  0  0  0  0  0  0   0   0
# 5   0  0  0  0  0  0  0  0  0  0   0   0
# 6   0  0  0  0  0  0  0  0  0  0   0   0
# 7   0  0  0  0  0  0  0  0  0  0   0   0
# 8   0  0  0  0  0  0  0  0  0  0   0   0
# 9   0  0  0  0  0  0  0  0  0  0   0   0
# 10  0  0  0  0  0  0  0  0  0  0   0   0
# 11  0  0  0  0  0  0  0  0  0  0   0   0

Example not sharing memory:

import pandas as pd
import numpy as np

# some data frame that I expect to be copied
types = [int, str, float]
frame = pd.DataFrame({
    i: [types[i%len(types)](value) for value in col]
    for i, col in enumerate(np.arange(144).reshape(12,12).T)
})
array = frame.to_numpy()
array[:] = 0
print(frame)
# Prints:
#     0   1     2   3   4     5   6   7      8    9    10     11
# 0    0  12  24.0  36  48  60.0  72  84   96.0  108  120  132.0
# 1    1  13  25.0  37  49  61.0  73  85   97.0  109  121  133.0
# 2    2  14  26.0  38  50  62.0  74  86   98.0  110  122  134.0
# 3    3  15  27.0  39  51  63.0  75  87   99.0  111  123  135.0
# 4    4  16  28.0  40  52  64.0  76  88  100.0  112  124  136.0
# 5    5  17  29.0  41  53  65.0  77  89  101.0  113  125  137.0
# 6    6  18  30.0  42  54  66.0  78  90  102.0  114  126  138.0
# 7    7  19  31.0  43  55  67.0  79  91  103.0  115  127  139.0
# 8    8  20  32.0  44  56  68.0  80  92  104.0  116  128  140.0
# 9    9  21  33.0  45  57  69.0  81  93  105.0  117  129  141.0
# 10  10  22  34.0  46  58  70.0  82  94  106.0  118  130  142.0
# 11  11  23  35.0  47  59  71.0  83  95  107.0  119  131  143.0
like image 469
Martin Avatar asked Jun 10 '20 12:06

Martin


People also ask

What does DataFrame copy () do?

Pandas DataFrame copy() Method The copy() method returns a copy of the DataFrame. By default, the copy is a "deep copy" meaning that any changes made in the original DataFrame will NOT be reflected in the copy.

How do you create a copy of a DataFrame?

To create a shallow copy of Pandas DataFrame, use the df. copy(deep=False) method. Pandas DataFrame copy() function makes a copy of this object's indices and data. When deep=True (default), the new object will be created with a copy of the calling object's data and indices.

How can you check if a DataFrame is empty in pandas?

You can use the attribute df. empty to check whether it's empty or not: if df. empty: print('DataFrame is empty!

What is To_numpy?

to_numpy() function is used to return a NumPy ndarray representing the values in given Series or Index. This function will explain how we can convert the pandas Series to numpy Array.


2 Answers

There is numpy.shares_memory you can use:

# Your first example
print(np.shares_memory(array, frame))  # True, they are sharing memory

# Your second example
print(np.shares_memory(array2, frame2))  # False, they are not sharing memory

There is also numpy.may_share_memory, which is faster but can only be used for making sure things do not share memory (because it only checks whether the bounds overlap), so strictly speaking does not answer the question. Read this for the differences.

Take care using these numpy functions with pandas data-structures: np.shares_memory(frame, frame) returns True for the first example, but False for the second, probably because the __array__ method of the data frame in the second example creates a copy behind the scenes.

like image 50
ywbaek Avatar answered Oct 25 '22 14:10

ywbaek


In your first case you make the frame from an array. The source array is used 'as-is' as the data for the frame. That is, the frame just adds its indices and methods to the original array:

In [377]: arr = np.arange(12).reshape(3,4)                                                    
In [378]: df = pd.DataFrame(arr)                                                              
In [379]: df                                                                                  
Out[379]: 
   0  1   2   3
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
In [380]: arr1 = df.to_numpy()                                                                
In [381]: arr1                                                                                
Out[381]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

I like to compare arrays with the array_interface dictionary. Note that the data is identical in both:

In [382]: arr.__array_interface__                                                             
Out[382]: 
{'data': (53291792, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (3, 4),
 'version': 3}
In [383]: arr1.__array_interface__                                                            
Out[383]: 
{'data': (53291792, False),
 'strides': None,
 'descr': [('', '<i8')],
 'typestr': '<i8',
 'shape': (3, 4),
 'version': 3}

I could do the mutation test as well.

In the second case you make the frame from a dictionary. I suspect in this case the frame is actually a collection of pd.Series, though I'm not sure how to test that.

In [393]: df1 = pd.DataFrame({'a':np.arange(3), 'b':np.ones(3)})                              
In [394]: df1                                                                                 
Out[394]: 
   a    b
0  0  1.0
1  1  1.0
2  2  1.0
In [395]: x = df1.to_numpy()                                                                  
In [396]: x                                                                                   
Out[396]: 
array([[0., 1.],
       [1., 1.],
       [2., 1.]])

The change in dtypes is good indication that x is a copy. Columns of df1 differ in dtype, while x is all float.

And with the mutating test:

In [397]: x *= 0                                                                              
In [398]: df1                                                                                 
Out[398]: 
   a    b
0  0  1.0
1  1  1.0
2  2  1.0

on the other hand, constructing the same frame but with all floats, the array isn't a copy:

In [399]: df1 = pd.DataFrame({'a':np.arange(3.), 'b':np.ones(3)})                             
In [400]: df1                                                                                 
Out[400]: 
     a    b
0  0.0  1.0
1  1.0  1.0
2  2.0  1.0
In [401]: x = df1.to_numpy()                                                                  
In [402]: x *= 0                                                                              
In [403]: df1                                                                                 
Out[403]: 
     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  0.0

Others have suggested looking at the flags. I'm not sure that's reliable. I checked the [396] case, and x did not owndata.

I probably haven't added much to your observations. I think we need to dig more into how a frame stores its data. That may depend, not only on how the frame was constructed, but also on how it was modified (for example, what happens when I add a column?).

df.to_numpy is just np.array(self.values, dtype=dtype, copy=copy). At this level, whether it's a copy or not depends on the dtype conversion, if any.

df.values is a property that does:

self._consolidate_inplace()
return self._data.as_array(transpose=self._AXIS_REVERSED)

df._data is a BlockManager (at least in my examples)

If this is a single_block, its as_array does

np.asarray(mgr.blocks[0].get_values())

I was going to show the BlockMangers for the different dataframes, but just lost that interactive Ipython session.

The [379] frame has just one integer block; the [394] frame has two, one float, one integer.

In any case, there's a lot of pandas code behind the to_numpy() method. And much of it depends on exactly how the data is stored for that frame. So I don't think there's a simple surefire way of identifying whether an array is a copy or not. Except in simple, uniform dataframe cases, it's better to assume it's a copy. But be wary of modifying the array if you don't want to modify the frame.

Use df.to_numpy(copy=True) to be sure that you get a copy.

I don't think you can be sure about getting a view. If the df has a uniform, matching dtype, there's a good chance it's a view, especially if the construction wasn't too convoluted.

====

In [2]: df = pd.DataFrame(np.ones((3,4),int))                                                                   
In [3]: df                                                                                                      
Out[3]: 
   0  1  2  3
0  1  1  1  1
1  1  1  1  1
2  1  1  1  1
In [4]: df.to_numpy().flags                                                                                     
Out[4]: 
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False              <====
  ...
In [5]: df.to_numpy(copy=True).flags                                                                            
Out[5]: 
  ...
  OWNDATA : True

Now a frame with mixed dtypes:

In [7]: df1 = pd.DataFrame({'a':np.arange(3), 'b':np.ones(3)})                                                  
In [8]: df1                                                                                                     
Out[8]: 
   a    b
0  0  1.0
1  1  1.0
2  2  1.0

This is a copy, but doesn't owndata. Note that this is F_CONTIGUOUS; I think that means there's a transpose in the generation code, which would account for the False owndata:

In [10]: df1.to_numpy().flags                                                                                   
Out[10]: 
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  ...
In [11]: df1.to_numpy()                                                                                         
Out[11]: 
array([[0., 1.],
       [1., 1.],
       [2., 1.]])

The BlockManager has two blocks, one for each dtype:

In [12]: df1._data                                                                                              
Out[12]: 
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
FloatBlock: slice(1, 2, 1), 1 x 3, dtype: float64
IntBlock: slice(0, 1, 1), 1 x 3, dtype: int64

df1.values is:

return self._data.as_array(transpose=self._AXIS_REVERSED)

as_array without transpose and with:

In [14]: df1._data.as_array()                                                                                   
Out[14]: 
array([[0., 1., 2.],
       [1., 1., 1.]])
In [15]: df1._data.as_array(transpose=True)                                                                     
Out[15]: 
array([[0., 1.],
       [1., 1.],
       [2., 1.]])

So to_numpy uses np.array(values) with the potential of copy and dtype. values passes the task BlockManager, which does at least one np.asarray() and a (probable) transpose. If there are more than one block, it does an _interleave (which I haven't explored).

So while to_numpy(copy=True) ensures a copy, it's harder to predict/detect whether processing up to that point has created a copy or not.

like image 30
hpaulj Avatar answered Oct 25 '22 14:10

hpaulj