With Pandas 1.0.1, I'm unable to merge if the
df = df.merge(df2, on=some_column)
yields
File /home/torstein/code/fintechdb/Sheets/sheets/gild.py, line 42, in gild
df = df.merge(df2, on=some_column)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py, line 7297, in merge
validate=validate,
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 88, in merge
return op.get_result()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 643, in get_result
join_index, left_indexer, right_indexer = self._get_join_info()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 862, in _get_join_info
(left_indexer, right_indexer) = self._get_join_indexers()
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 841, in _get_join_indexers
self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1311, in _get_join_indexers
zipped = zip(*mapped)
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1309, in <genexpr>
for n in range(len(left_keys))
File /home/torstein/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/merge.py, line 1918, in _factorize_keys
rlab = rizer.factorize(rk)
File pandas/_libs/hashtable.pyx, line 77, in pandas._libs.hashtable.Factorizer.factorize
File pandas/_libs/hashtable_class_helper.pxi, line 1817, in pandas._libs.hashtable.PyObjectHashTable.get_labels
File pandas/_libs/hashtable_class_helper.pxi, line 1732, in pandas._libs.hashtable.PyObjectHashTable._unique
File pandas/_libs/missing.pyx, line 360, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous
while this works:
df[some_column].fillna(np.nan, inplace=True)
df2[some_column].fillna(np.nan, inplace=True)
df = df.merge(df2, on=some_column)
# Works
If instead, I do
df[some_column].fillna(pd.NA, inplace=True)
then the error returns.
This has to do with pd.NA
being implemented in pandas 1.0.0 and how the pandas team decided it should work in a boolean context. Also, you take into account it is an experimental feature, hence it shouldn't be used for anything but experimenting:
Warning Experimental: the behaviour of pd.NA can still change without warning.
In another link of pandas documentation, where it covers working with missing values, is where I believe the reason and the answer you are looking for can be found:
NA in a boolean context: Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. The following raises an error: TypeError: boolean value of NA is ambiguous
Furthermore, it provides a valuable piece of advise:
"This also means that pd.NA cannot be used in a context where it is evaluated to a boolean, such as if condition: ... where condition can potentially be pd.NA. In such cases, isna() can be used to check for pd.NA or condition being pd.NA can be avoided, for example by filling missing values beforehand."
I decided that the pd.NA
instances in my data were valid, and hence I needed to deal with them rather than filling them, like with fillna()
. If you're like me in this case, then convert it from pd.NA
to either True
or False
by simply using pd.isna(val)
. Only you can decide whether the null should come out T or F, but here's a simple example:
val = pd.NA
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is null
Then,
val = 7
if pd.isna(val) :
print('it is null')
else :
print('it is not null')
returns: it is not null
Hope this helps other trying to get a definitive course of action (Celius's answer is accurate, but I wanted to provide actionable code for those struggling with this).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With