Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Will changes in DataFrame.values always modify the values in the data frame?

On the documentation, it says

Numpy representation of NDFrame -- Source

What does "Numpy representation of NDFrame" mean? Will modifying this numpy representation affect my original dataframe? In other words, will .values return a copy or a view?

There are answers to questions in StackOverflow implicitly suggesting (relying on) that a view be returned. For example, in the accepted answer of Set values on the diagonal of pandas.DataFrame,np.fill_diagonal(df.values, 0) is used to set all values on the diagonal part of df to 0. That is a view is returned in this case. However, as shown in @coldspeed's answer, sometimes a copy is returned.

This feels very basic. It is just a bit weird to me because I do not have a more detailed source of .values.


Another experiment that returns a view in addition to the current experiments in @coldspeed's answer:

df = pd.DataFrame([["A", "B"],["C", "D"]])

df.values[0][0] = 0

We get

df
    0   1
0   0   B
1   C   D

Even though it is mixed type now, we can still modify original df by setting df.values

df.values[0][1] = 5
df
    0   1
0   0   5
1   C   D
like image 499
Tai Avatar asked Jan 11 '18 07:01

Tai


1 Answers

TL;DR:

It's an implementation detail if a copy is returned (then changing the values would not change the DataFrame) or if values returns a view (then changing the values would change the DataFrame). Don't rely on any of these cases. It could change if the pandas developers think it would be beneficial (for example if they changed the internal structure of DataFrame).


I guess the documentation has changed since the question was asked, currently it reads:

pandas.DataFrame.values

Return a Numpy representation of the DataFrame.

Only the values in the DataFrame will be returned, the axes labels will be removed.

It doesn't mention NDFrame anymore - but simply mentions a "NumPy representation of the DataFrame". A NumPy representation could be either a view or a copy!

The documentation also contains a Note about mixed dtypes:

Notes

The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64 will result in a float64 dtype.

From these Notes it's obvious that accessing the values of a DataFrame that contains different dtypes can (almost) never return a view. Simply because it needs to put the values into an array of the "lowest-common-denominator" dtype and that involves a copy.

However it doesn't say anything about the view / copy behavior and that's by design. jreback mentioned on the pandas issue tracker 1 that this really is just an implementation detail:

this is an implementation detail. since you are getting a single dtyped numpy array, it is upcast to a compatible dtype. if you have mixed dtypes, then you almost always will have a copy (the exception is mixed float dtypes will not copy I think), but this is a numpy detail.

I agree this is not great, but it has been there from the beginning and will not change in current pandas. If exporting to numpy you need to take care.

Even the documentation of Series mentions nothing about a view:

pandas.Series.values

Return Series as ndarray or ndarray-like depending on the dtype

It even mentions that it might not even return a plain array depending on the dtype. And that certainly includes the possibility (even if it's only hypothetical) that it returns a copy. It does not guarantee that you get a view.


When does .values return a view and when does it return a copy?

The answer is simply: It's an implementation detail and as long as it's an implementation detail there won't be any guarantees. The reason it's an implementation detail is because the pandas developers want to make certain that they can change the internal storage if they want to. However in some cases it's impossible to create a view. For example with a DataFrame containing columns of different dtypes.

There might be advantages if you analyze the behavior to date. But as long as that's an implementation detail you shouldn't really rely on it anyways.

However if you're interested: Pandas currently stores columns with the same dtype internally as multi-dimensional array. That has the advantage that you can operate on rows and columns very efficiently (at least as long as they have the same dtype). But if the DataFrame contains mixed types it will have several internal multi-dimensional arrays. One for each dtype. It's not possible to create a view that points into two distinct arrays (at least for NumPy) so when you have mixed dtypes you'll get a copy when you want the values.


A side-note, your example:

df = pd.DataFrame([["A", "B"],["C", "D"]])

df.values[0][0] = 0

Isn't mixed-dtype. It has a specific dtype: object. However object arrays can contain any Python object, so I can see why you would say/assume that it's of mixed types.


Personal note:

Personally I would have preferred that the values property only ever returns views or errors when it cannot return a view and an additional method (e.g. as_array) that only ever returns copies even if it would be possible to get a view. That would certainly make the behavior more predictable and avoid some surprises like having a property doing an expensive copy is certainly unexpected.


1 This question has been mentioned in the issue post, so maybe the docs changed because of this question.

like image 106
MSeifert Avatar answered Nov 15 '22 14:11

MSeifert