Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assigning rank 2 numpy array to pandas DataFrame column behaves inconsistently

I’ve noticed that assigning to a pandas DataFrame column (using the .loc indexer) behaves differently depending on what other columns are present in the DataFrame and on the exact form of the assignment. Using three example DataFrames:

df1 = pandas.DataFrame({
    'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
})
#         col1
# 0  [1, 2, 3]
# 1  [4, 5, 6]
# 2  [7, 8, 9]
df2 = pandas.DataFrame({
    'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    'col2': [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
})
#         col1          col2
# 0  [1, 2, 3]  [10, 20, 30]
# 1  [4, 5, 6]  [40, 50, 60]
# 2  [7, 8, 9]  [70, 80, 90]
df3 = pandas.DataFrame({
    'col1': [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
    'col2': [1, 2, 3]
})
#         col1  col2
# 0  [1, 2, 3]     1
# 1  [4, 5, 6]     2
# 2  [7, 8, 9]     3
x = numpy.array([[111, 222, 333],
                 [444, 555, 666],
                 [777, 888, 999]])

I’ve found the following:

  1. df1:

    1. df1.col1 = x

      Result:

      df1
      #    col1
      # 0   111
      # 1   444
      # 2   777
      
    2. df1.loc[:, 'col1'] = x

      Result:

      df1
      #    col1
      # 0   111
      # 1   444
      # 2   777
      
    3. df1.loc[0:2, 'col1'] = x

      Result:

      # […]
      # ValueError: could not broadcast input array from shape (3,3) into shape (3)
      
  2. df2:

    1. df2.col1 = x

      Result:

      df2
      #    col1          col2
      # 0   111  [10, 20, 30]
      # 1   444  [40, 50, 60]
      # 2   777  [70, 80, 90]
      
    2. df2.loc[:, 'col1'] = x

      Result:

      df2
      #    col1          col2
      # 0   111  [10, 20, 30]
      # 1   444  [40, 50, 60]
      # 2   777  [70, 80, 90]
      
    3. df2.loc[0:2, 'col1'] = x

      Result:

      # […]
      # ValueError: could not broadcast input array from shape (3,3) into shape (3)
      
  3. df3:

    1. df3.col1 = x

      Result:

      df3
      #    col1  col2
      # 0   111     1
      # 1   444     2
      # 2   777     3
      
    2. df3.loc[:, 'col1'] = x

      Result:

      # ValueError: Must have equal len keys and value when setting with an ndarray
      
    3. df3.loc[0:2, 'col1'] = x

      Result:

      # ValueError: Must have equal len keys and value when setting with an ndarray
      

So it seems that df.loc seems to behave differently if one of the other columns in the DataFrame does not have dtype object.

My question is:

  • Why would the presence of other columns make a difference in this kind of assignment?
  • Why are the different versions of the assignment not equivalent? In particular, why is the result in the cases which don’t result in ValueError that the DataFrame column is filled with the values of the first column of the numpy array?

Note: I’m not interested in discussing whether it makes sense to assign a column to a numpy array in this way. I only want to know about the differences in behavior, and whether this might count as a bug.

like image 581
Socob Avatar asked Aug 29 '18 14:08

Socob


2 Answers

Why would the presence of other columns make a difference in this kind of assignment?

The simple answer is because Pandas checks for mixed types within a dataframe. You can check this for yourself using the same method used in the source code:

print(df1._is_mixed_type)  # False
print(df2._is_mixed_type)  # False
print(df3._is_mixed_type)  # True

The logic used differs based on the value of _is_mixed_type. Specifically, the following test in _setitem_with_indexer fails when _is_mixed_type is True for the inputs you have provided:

if len(labels) != value.shape[1]:
    raise ValueError('Must have equal len keys and value '
                     'when setting with an ndarray')

In other words, there are more columns in the array than there are columns to assign to in the dataframe.

Is this a bug? In my opinion, any use of lists or arrays in a Pandas dataframe is fraught with danger.1 The ValueError check was added to fix a more important issue (GH 7551).


Why are the different versions of the assignment not equivalent?

The reason why assignment via df3['col1'] = x works is because col1 is an existing series. Try df3['col3'] = x and your code will fail with ValueError.

Digging deeper, the __setitem__ method for a datframe, for which df[] is syntactic sugar, converts the 'col1' label to a series (if it exists) via key = com._apply_if_callable(key, self):

def _apply_if_callable(maybe_callable, obj, **kwargs):
    """
    Evaluate possibly callable input using obj and kwargs if it is callable,
    otherwise return as it is
    """
    if callable(maybe_callable):
        return maybe_callable(obj, **kwargs)
    return maybe_callable

The logic can then sidestep the checking logic in _setitem_with_indexer. You can infer this because we jump to _setitem_array instead of _set_item when we provide a label for an existing series:

def __setitem__(self, key, value):

    key = com._apply_if_callable(key, self)

    if isinstance(key, (Series, np.ndarray, list, Index)):
        self._setitem_array(key, value)
    elif isinstance(key, DataFrame):
        self._setitem_frame(key, value)
    else:
        self._set_item(key, value)

All the above are implementation details; you should not base your Pandas syntax based on these underlying methods, as they may change going forwards.


1 I would go as far as to say it should be disabled by default and only enabled via a setting. It is a hugely inefficient way of storing and manipulating data. Sometimes it offers short-term convenience, but at the expense of obfuscated code down the line.

like image 148
jpp Avatar answered Nov 15 '22 16:11

jpp


First, let me attempt a less technical and less rigorous version of @jpp's explanation. Generally speaking, when you attempt to insert a numpy array into a pandas dataframe, pandas expects them to have the same rank and dimension (e.g. both are 4x2, although it can also be OK if the rank of the numpy array is lower than the pandas section. It's OK, for example, if the pandas dimension is 4x2 and the numpy dimension is 4x1 or 2x1 -- just read up on numpy broadcasting for more info).

The point of the preceding is simply that when you try to put a 3x3 numpy array into a pandas column of length 3 (basically 3x1), pandas doesn't really have a standard way to handle that, and the inconsistent behavior is simply a result of that. It would perhaps be better if pandas always raised an exception, but generally speaking pandas attempts to do something, but it just might not be something useful.

Second, (and I realize this is not a literal answer) in the long run I can guarantee you'll be much much better off if you don't spend a lot of time working out the gory details of cramming two dimensional arrays into single pandas columns. Instead, just follow a more typical pandas approach like the following, which will produce code that (1) behaves more predictably, (2) is more readable, and (3) runs much faster.

x = np.arange(1,10).reshape(3,3)
y = x * 10
z = x * 100

df = pd.DataFrame( np.hstack((x,y)), columns=['x1 x2 x3 y1 y2 y3'.split()] )

#   x1 x2 x3  y1  y2  y3
# 0  1  2  3  10  20  30
# 1  4  5  6  40  50  60
# 2  7  8  9  70  80  90

df.loc[:,'x1':'x3'] = z

#     x1   x2   x3  y1  y2  y3
# 0  100  200  300  10  20  30
# 1  400  500  600  40  50  60
# 2  700  800  900  70  80  90

I kept this as a simple index but it looks perhaps like what you are trying to do is set up a more hierarchical structure, and pandas can help there, with a feature called a MultiIndex. In this case the result is cleaner syntax but note that it can be more complicated to use in other cases (not worth going into details here):

df = pd.DataFrame( np.hstack((x,y)), 
     columns=pd.MultiIndex.from_product( [list('xy'),list('123')] ) )

df.loc[:,'x'] = z       # now you can replace 'x1':'x3' with 'x'

And you probably know this, but it is also extremely easy to extract numpy arrays from dataframes, so you haven't lost anything just by putting the numpy array into multiple columns. For example, in the multi-index case:

df.loc[:,'x'].values

# array([[100, 200, 300],
#        [400, 500, 600],
#        [700, 800, 900]])
like image 30
JohnE Avatar answered Nov 15 '22 14:11

JohnE