Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas DataFrame Assignment Bug using Dictionaries of Strings and Floats?

Problem

Pandas seems to support using df.loc to assign a dictionary to a row entry, like the following:

df = pd.DataFrame(columns = ['a','b','c'])
entry = {'a':'test', 'b':1, 'c':float(2)}
df.loc[0] = entry

As expected, Pandas inserts the dictionary values to the corresponding columns based on the dictionary keys. Printing this gives:

      a  b    c
0  test  1  2.0

However, if you overwrite the same entry, Pandas will assign the dictionary keys instead of the dictionary values. Printing this gives:

   a  b  c
0  a  b  c

Question

Why does this happen?

Specifically, why does this only happen on the second assignment? All subsequent assignments revert to the original result, containing (almost) the expected values:

      a  b  c
0  test  1  2

I say almost because the dtype on c is actually an object instead of float for all subsequent results.


I've determined that this happens whenever there is a string and a float involved. You won't find this behavior if it's just a string and integer, or integer and float.

Example Code

df = pd.DataFrame(columns = ['a','b','c'])
print(f'empty df:\n{df}\n\n')

entry = {'a':'test', 'b':1, 'c':float(2.3)}
print(f'dictionary to be entered:\n{entry}\n\n')

df.loc[0] = entry
print(f'df after entry:\n{df}\n\n')

df.loc[0] = entry
print(f'df after second entry:\n{df}\n\n')

df.loc[0] = entry
print(f'df after third entry:\n{df}\n\n')

df.loc[0] = entry
print(f'df after fourth entry:\n{df}\n\n')

This gives the following printout:

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []


dictionary to be entered:
{'a': 'test', 'b': 1, 'c': float(2)}


df after entry:
      a  b    c
0  test  1  2.0


df after second entry:
   a  b  c
0  a  b  c


df after third entry:
      a  b  c
0  test  1  2


df after fourth entry:
      a  b  c
0  test  1  2
like image 377
ThatNewGuy Avatar asked May 20 '21 18:05

ThatNewGuy


2 Answers

The 1.2.4 behaviour is as follows:

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []


dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}


df after entry:
      a  b    c
0  test  1  2.3


df after second entry:
   a  b  c
0  a  b  c


df after third entry:
   a  b  c
0  a  b  c


df after fourth entry:
   a  b  c
0  a  b  c

The first time df.loc[0] the function is the _setitem_with_indexer_missing function is run since there is no index 0 on the axis:

This line is run:

elif isinstance(value, dict):
    value = Series(
        value, index=self.obj.columns, name=indexer, dtype=object
    )

Which turns the dict into a series and it behaves as expected.


In future times, however, since the index is not missing (there exists an index 0) _setitem_with_indexer_split_path is run:

elif len(ilocs) == len(value):
    # We are setting multiple columns in a single row.
    for loc, v in zip(ilocs, value):
        self._setitem_single_column(loc, v, pi)

This just zips the column locations with the each value from the dict:

In this case that's something roughly equivalent to:

entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
print(list(zip([0, 1, 2], entry)))
# [(0, 'a'), (1, 'b'), (2, 'c')]

Hence why the values are the keys.


For this reason, the problem isn't as specific as it may seem:

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
print(f'df:\n{df}\n\n')

entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
print(f'dictionary to be entered:\n{entry}\n\n')

df.loc[0] = entry
print(f'df after entry:\n{df}\n\n')
initial df:
   a  b  c
0  1  2  3

dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}

df after entry:
   a  b  c
0  a  b  c

If the index loc exists it will not convert to a series: it simply zips the column locs with the iterable. In the case of the dictionary this means the keys are the values that get included in the frame.

This is also likely the reason why only iterables whose iterators return their values are acceptable left-hand arguments to loc assignment.


I also concur with @DeepSpace that this should be raised as a bug.


The 1.1.5 Behaviour is as follows:

The initial assignment is unchanged from 1.2.4 however:

The dtypes are noteable here:

import pandas as pd

df = pd.DataFrame({0: [1, 2, 3]}, columns=['a', 'b', 'c'])

entry = {'a': 'test', 'b': 1, 'c': float(2.3)}

# First Entry
df.loc[0] = entry
print(df.dtypes)
# a     object
# b     object
# c    float64
# dtype: object

# Second Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

# Third Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

# Fourth Entry
df.loc[0] = entry
print(df.dtypes)
# a    object
# b    object
# c    object
# dtype: object

The reason they are notable is because when

take_split_path = self.obj._is_mixed_type

is true. It does the same zip thing that it does in 1.2.4.

However, in 1.1.5 The dtypes are all object so take_split_path is only false after the first assignment since c is float64. Subsequent assignments use:

if isinstance(value, (ABCSeries, dict)):
    # TODO(EA): ExtensionBlock.setitem this causes issues with
    # setting for extensionarrays that store dicts. Need to decide
    # if it's worth supporting that.
    value = self._align_series(indexer, Series(value))

Which, naturally, aligns the dict correctly.

like image 97
Henry Ecker Avatar answered Nov 14 '22 19:11

Henry Ecker


Interesting find. On pandas version 1.2.4, all the subsequent dataframes have the value a b c, not just the second one.

empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []

dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}

df after entry:
      a  b    c
0  test  1  2.3

df after second entry:
   a  b  c
0  a  b  c

df after third entry:
   a  b  c
0  a  b  c

Btw, it only seems to work correctly when assigning to a new row. So it's only associating the keys with the columns in that situation. For all subsequent re-assigning to existing rows, it has the observed unexpected behaviour, in 1.2.4.

df.loc[1] = entry
print(f'df after assigning to a new row:\n{df}\n\n')
# output:
df after assigning to a new row:
      a  b    c
0     a  b    c
1  test  1  2.3

df.loc[1] = entry
print(f'df after reapting:\n{df}\n')
# output:
df after reapting:
   a  b  c
0  a  b  c
1  a  b  c

The reason it may be happening for existing rows (apart from being a bug) is that it's iterating over the collection. In the case of dictionaries, it's the keys. In the docs section "Setting with enlargement"

The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.

In the Series case this is effectively an appending operation.

So for new rows, it's "enlarging" the input but for existing rows, it's iterating over the input (keys for dicts, not values).

For a list, it woks as one would expect.

df.loc[2] = list(entry.values())
print(f'df when assigning from a list\n{df}\n')
# output
df when assigning from a list
      a  b    c
0     a  b    c
1     a  b    c
2  test  1  2.3


df.loc[2] = list(entry.values())
print(f'df when assigning from a list 2nd time\n{df}\n')
# output
df when assigning from a list 2nd time
      a  b    c
0     a  b    c
1     a  b    c
2  test  1  2.3

(That's the why based on the docs. I think the actual technical reason may only be apparent after perusing the source code.)

Imho, it should either work for all assignments/re-assignemnts or not be allowed at all. I agree that this should be raised as a bug, as @DeepSpace mentions.

like image 35
aneroid Avatar answered Nov 14 '22 21:11

aneroid