I’ve noticed that assigning to a pandas
DataFrame
column (using the .loc
indexer) behaves differently depending on what other columns are present in the DataFrame
and on the exact form of the assignment. Using three example DataFrame
s:
df1 = pandas.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
})
# col1
# 0 [1, 2, 3]
# 1 [4, 5, 6]
# 2 [7, 8, 9]
df2 = pandas.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
'col2': [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
})
# col1 col2
# 0 [1, 2, 3] [10, 20, 30]
# 1 [4, 5, 6] [40, 50, 60]
# 2 [7, 8, 9] [70, 80, 90]
df3 = pandas.DataFrame({
'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
'col2': [1, 2, 3]
})
# col1 col2
# 0 [1, 2, 3] 1
# 1 [4, 5, 6] 2
# 2 [7, 8, 9] 3
x = numpy.array([[111, 222, 333],
[444, 555, 666],
[777, 888, 999]])
I’ve found the following:
df1
:
df1.col1 = x
Result:
df1
# col1
# 0 111
# 1 444
# 2 777
df1.loc[:, 'col1'] = x
Result:
df1
# col1
# 0 111
# 1 444
# 2 777
df1.loc[0:2, 'col1'] = x
Result:
# […]
# ValueError: could not broadcast input array from shape (3,3) into shape (3)
df2
:
df2.col1 = x
Result:
df2
# col1 col2
# 0 111 [10, 20, 30]
# 1 444 [40, 50, 60]
# 2 777 [70, 80, 90]
df2.loc[:, 'col1'] = x
Result:
df2
# col1 col2
# 0 111 [10, 20, 30]
# 1 444 [40, 50, 60]
# 2 777 [70, 80, 90]
df2.loc[0:2, 'col1'] = x
Result:
# […]
# ValueError: could not broadcast input array from shape (3,3) into shape (3)
df3
:
df3.col1 = x
Result:
df3
# col1 col2
# 0 111 1
# 1 444 2
# 2 777 3
df3.loc[:, 'col1'] = x
Result:
# ValueError: Must have equal len keys and value when setting with an ndarray
df3.loc[0:2, 'col1'] = x
Result:
# ValueError: Must have equal len keys and value when setting with an ndarray
So it seems that df.loc
seems to behave differently if one of the other columns in the DataFrame
does not have dtype object
.
My question is:
ValueError
that the DataFrame
column is filled with the values of the first column of the numpy
array?Note: I’m not interested in discussing whether it makes sense to assign a column to a numpy
array in this way. I only want to know about the differences in behavior, and whether this might count as a bug.
Why would the presence of other columns make a difference in this kind of assignment?
The simple answer is because Pandas checks for mixed types within a dataframe. You can check this for yourself using the same method used in the source code:
print(df1._is_mixed_type) # False
print(df2._is_mixed_type) # False
print(df3._is_mixed_type) # True
The logic used differs based on the value of _is_mixed_type
. Specifically, the following test in _setitem_with_indexer
fails when _is_mixed_type
is True
for the inputs you have provided:
if len(labels) != value.shape[1]:
raise ValueError('Must have equal len keys and value '
'when setting with an ndarray')
In other words, there are more columns in the array than there are columns to assign to in the dataframe.
Is this a bug? In my opinion, any use of lists or arrays in a Pandas dataframe is fraught with danger.1 The ValueError
check was added to fix a more important issue (GH 7551).
Why are the different versions of the assignment not equivalent?
The reason why assignment via df3['col1'] = x
works is because col1
is an existing series. Try df3['col3'] = x
and your code will fail with ValueError
.
Digging deeper, the __setitem__
method for a datframe, for which df[]
is syntactic sugar, converts the 'col1'
label to a series (if it exists) via key = com._apply_if_callable(key, self)
:
def _apply_if_callable(maybe_callable, obj, **kwargs):
"""
Evaluate possibly callable input using obj and kwargs if it is callable,
otherwise return as it is
"""
if callable(maybe_callable):
return maybe_callable(obj, **kwargs)
return maybe_callable
The logic can then sidestep the checking logic in _setitem_with_indexer
. You can infer this because we jump to _setitem_array
instead of _set_item
when we provide a label for an existing series:
def __setitem__(self, key, value):
key = com._apply_if_callable(key, self)
if isinstance(key, (Series, np.ndarray, list, Index)):
self._setitem_array(key, value)
elif isinstance(key, DataFrame):
self._setitem_frame(key, value)
else:
self._set_item(key, value)
All the above are implementation details; you should not base your Pandas syntax based on these underlying methods, as they may change going forwards.
1 I would go as far as to say it should be disabled by default and only enabled via a setting. It is a hugely inefficient way of storing and manipulating data. Sometimes it offers short-term convenience, but at the expense of obfuscated code down the line.
First, let me attempt a less technical and less rigorous version of @jpp's explanation. Generally speaking, when you attempt to insert a numpy array into a pandas dataframe, pandas expects them to have the same rank and dimension (e.g. both are 4x2, although it can also be OK if the rank of the numpy array is lower than the pandas section. It's OK, for example, if the pandas dimension is 4x2 and the numpy dimension is 4x1 or 2x1 -- just read up on numpy broadcasting for more info).
The point of the preceding is simply that when you try to put a 3x3 numpy array into a pandas column of length 3 (basically 3x1), pandas doesn't really have a standard way to handle that, and the inconsistent behavior is simply a result of that. It would perhaps be better if pandas always raised an exception, but generally speaking pandas attempts to do something, but it just might not be something useful.
Second, (and I realize this is not a literal answer) in the long run I can guarantee you'll be much much better off if you don't spend a lot of time working out the gory details of cramming two dimensional arrays into single pandas columns. Instead, just follow a more typical pandas approach like the following, which will produce code that (1) behaves more predictably, (2) is more readable, and (3) runs much faster.
x = np.arange(1,10).reshape(3,3)
y = x * 10
z = x * 100
df = pd.DataFrame( np.hstack((x,y)), columns=['x1 x2 x3 y1 y2 y3'.split()] )
# x1 x2 x3 y1 y2 y3
# 0 1 2 3 10 20 30
# 1 4 5 6 40 50 60
# 2 7 8 9 70 80 90
df.loc[:,'x1':'x3'] = z
# x1 x2 x3 y1 y2 y3
# 0 100 200 300 10 20 30
# 1 400 500 600 40 50 60
# 2 700 800 900 70 80 90
I kept this as a simple index but it looks perhaps like what you are trying to do is set up a more hierarchical structure, and pandas can help there, with a feature called a MultiIndex. In this case the result is cleaner syntax but note that it can be more complicated to use in other cases (not worth going into details here):
df = pd.DataFrame( np.hstack((x,y)),
columns=pd.MultiIndex.from_product( [list('xy'),list('123')] ) )
df.loc[:,'x'] = z # now you can replace 'x1':'x3' with 'x'
And you probably know this, but it is also extremely easy to extract numpy arrays from dataframes, so you haven't lost anything just by putting the numpy array into multiple columns. For example, in the multi-index case:
df.loc[:,'x'].values
# array([[100, 200, 300],
# [400, 500, 600],
# [700, 800, 900]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With