(2/19/2019): I opened up a report in the numexpr tracker: https://github.com/pydata/numexpr/issues/331
The pandas report is: https://github.com/pandas-dev/pandas/issues/25369
Unless I'm doing something I'm not supposed to, the new dtype extensions for nullable int appear to have a bug with the QUERY method on dataframe (the problem seems to be in the numexpr package):
df_test = pd.DataFrame(data=[4,5,6], columns=["col_test"])
df_test = df_test.astype(dtype={"col_test": pd.Int32Dtype()})
df_test.query("col_test != 6")
Last lines of the long error message are:
File "...\site_packages\numexpr\necompiler.py", line 822, in evaluate zip(names, arguments)] File "...\site_packages\numexpr\necompiler.py", line 821, in signature = [(name, getType(arg)) for (name, arg) in File "...\site_packages\numexpr\necompiler.py", line 703, in getType raise ValueError("unknown type %s" % a.dtype.name) ValueError: unknown type object
The non-extension dtypes work fine:
df_test = df_test.astype(dtype={"col_test": np.int32})
df_test.query("col_test != 6")
(p.s. as an entirely separate issue, passing the dtype to the pd.DataFrame constructor directly doesn't work--seems buggy).
Thanks.
Extension dtypes have been introduced for the first time in 0.24, and there are a lot of kinks to iron out.
That said, this seems to be some kind of compatibility issue between numexpr and pandas. This definitely looks buggy, and until it is fixed, we will have to fall back to the 'python'
engine.
df_test.query('col_test != 6', engine='python')
col_test
0 4
1 5
(More information on query
/eval
: Dynamic Expression Evaluation in pandas using pd.eval())
Notwithstanding the fact that you could just do
df_test.loc[df_test['col_test'] != 6]
col_test
0 4
1 5
Which is likely to be a lot faster (using engine='python'
offers no performance benefits over loc
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With