Pandas seems to support using df.loc
to assign a dictionary to a row entry, like the following:
df = pd.DataFrame(columns = ['a','b','c'])
entry = {'a':'test', 'b':1, 'c':float(2)}
df.loc[0] = entry
As expected, Pandas inserts the dictionary values to the corresponding columns based on the dictionary keys. Printing this gives:
a b c
0 test 1 2.0
However, if you overwrite the same entry, Pandas will assign the dictionary keys instead of the dictionary values. Printing this gives:
a b c
0 a b c
Why does this happen?
Specifically, why does this only happen on the second assignment? All subsequent assignments revert to the original result, containing (almost) the expected values:
a b c
0 test 1 2
I say almost because the dtype
on c
is actually an object
instead of float
for all subsequent results.
I've determined that this happens whenever there is a string and a float involved. You won't find this behavior if it's just a string and integer, or integer and float.
df = pd.DataFrame(columns = ['a','b','c'])
print(f'empty df:\n{df}\n\n')
entry = {'a':'test', 'b':1, 'c':float(2.3)}
print(f'dictionary to be entered:\n{entry}\n\n')
df.loc[0] = entry
print(f'df after entry:\n{df}\n\n')
df.loc[0] = entry
print(f'df after second entry:\n{df}\n\n')
df.loc[0] = entry
print(f'df after third entry:\n{df}\n\n')
df.loc[0] = entry
print(f'df after fourth entry:\n{df}\n\n')
This gives the following printout:
empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []
dictionary to be entered:
{'a': 'test', 'b': 1, 'c': float(2)}
df after entry:
a b c
0 test 1 2.0
df after second entry:
a b c
0 a b c
df after third entry:
a b c
0 test 1 2
df after fourth entry:
a b c
0 test 1 2
empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []
dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}
df after entry:
a b c
0 test 1 2.3
df after second entry:
a b c
0 a b c
df after third entry:
a b c
0 a b c
df after fourth entry:
a b c
0 a b c
The first time df.loc[0]
the function is the _setitem_with_indexer_missing
function is run since there is no index 0
on the axis:
This line is run:
elif isinstance(value, dict):
value = Series(
value, index=self.obj.columns, name=indexer, dtype=object
)
Which turns the dict
into a series and it behaves as expected.
In future times, however, since the index is not missing (there exists an index 0
) _setitem_with_indexer_split_path
is run:
elif len(ilocs) == len(value):
# We are setting multiple columns in a single row.
for loc, v in zip(ilocs, value):
self._setitem_single_column(loc, v, pi)
This just zips the column locations with the each value from the dict
:
In this case that's something roughly equivalent to:
entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
print(list(zip([0, 1, 2], entry)))
# [(0, 'a'), (1, 'b'), (2, 'c')]
Hence why the values are the keys.
For this reason, the problem isn't as specific as it may seem:
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
print(f'df:\n{df}\n\n')
entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
print(f'dictionary to be entered:\n{entry}\n\n')
df.loc[0] = entry
print(f'df after entry:\n{df}\n\n')
initial df:
a b c
0 1 2 3
dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}
df after entry:
a b c
0 a b c
If the index loc exists it will not convert to a series: it simply zips the column locs with the iterable. In the case of the dictionary this means the keys are the values that get included in the frame.
This is also likely the reason why only iterables whose iterators return their values are acceptable left-hand arguments to loc
assignment.
I also concur with @DeepSpace that this should be raised as a bug.
The initial assignment is unchanged from 1.2.4 however:
The dtypes are noteable here:
import pandas as pd
df = pd.DataFrame({0: [1, 2, 3]}, columns=['a', 'b', 'c'])
entry = {'a': 'test', 'b': 1, 'c': float(2.3)}
# First Entry
df.loc[0] = entry
print(df.dtypes)
# a object
# b object
# c float64
# dtype: object
# Second Entry
df.loc[0] = entry
print(df.dtypes)
# a object
# b object
# c object
# dtype: object
# Third Entry
df.loc[0] = entry
print(df.dtypes)
# a object
# b object
# c object
# dtype: object
# Fourth Entry
df.loc[0] = entry
print(df.dtypes)
# a object
# b object
# c object
# dtype: object
The reason they are notable is because when
take_split_path = self.obj._is_mixed_type
is true. It does the same zip thing that it does in 1.2.4.
However, in 1.1.5 The dtypes are all object
so take_split_path
is only false after the first assignment since c
is float64
. Subsequent assignments use:
if isinstance(value, (ABCSeries, dict)):
# TODO(EA): ExtensionBlock.setitem this causes issues with
# setting for extensionarrays that store dicts. Need to decide
# if it's worth supporting that.
value = self._align_series(indexer, Series(value))
Which, naturally, aligns the dict
correctly.
Interesting find. On pandas version 1.2.4
, all the subsequent dataframes have the value a b c
, not just the second one.
empty df:
Empty DataFrame
Columns: [a, b, c]
Index: []
dictionary to be entered:
{'a': 'test', 'b': 1, 'c': 2.3}
df after entry:
a b c
0 test 1 2.3
df after second entry:
a b c
0 a b c
df after third entry:
a b c
0 a b c
Btw, it only seems to work correctly when assigning to a new row. So it's only associating the keys with the columns in that situation. For all subsequent re-assigning to existing rows, it has the observed unexpected behaviour, in 1.2.4
.
df.loc[1] = entry
print(f'df after assigning to a new row:\n{df}\n\n')
# output:
df after assigning to a new row:
a b c
0 a b c
1 test 1 2.3
df.loc[1] = entry
print(f'df after reapting:\n{df}\n')
# output:
df after reapting:
a b c
0 a b c
1 a b c
The reason it may be happening for existing rows (apart from being a bug) is that it's iterating over the collection. In the case of dictionaries, it's the keys. In the docs section "Setting with enlargement"
The
.loc/[]
operations can perform enlargement when setting a non-existent key for that axis.In the
Series
case this is effectively an appending operation.
So for new rows, it's "enlarging" the input but for existing rows, it's iterating over the input (keys for dicts, not values).
For a list, it woks as one would expect.
df.loc[2] = list(entry.values())
print(f'df when assigning from a list\n{df}\n')
# output
df when assigning from a list
a b c
0 a b c
1 a b c
2 test 1 2.3
df.loc[2] = list(entry.values())
print(f'df when assigning from a list 2nd time\n{df}\n')
# output
df when assigning from a list 2nd time
a b c
0 a b c
1 a b c
2 test 1 2.3
(That's the why based on the docs. I think the actual technical reason may only be apparent after perusing the source code.)
Imho, it should either work for all assignments/re-assignemnts or not be allowed at all. I agree that this should be raised as a bug, as @DeepSpace mentions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With