I was experimenting several use cases for the pandas query() method, and tried one argument that threw an exception, but yet caused an unwanted modification to the data in my DataFrame.
In [549]: syn_fmax_sort
Out[549]:
build_number name fmax
0 390 adpcm 143.45
1 390 aes 309.60
2 390 dfadd 241.02
3 390 dfdiv 10.80
....
211 413 dfmul 215.98
212 413 dfsin 11.94
213 413 gsm 194.70
214 413 jpeg 197.75
215 413 mips 202.39
216 413 mpeg2 291.29
217 413 sha 243.19
[218 rows x 3 columns]
So I wanted to use query()
to just take out a subset of this dataframe that contains all the build_number
of 392, so I tried:
In [550]: syn_fmax_sort.query('build_number = 392')
That threw a ValueError: cannot label index with a null key
exception, but not only that, it returned back the full dataframe to me,and caused all the build_number
to be set to 392:
In [551]: syn_fmax_sort
Out[551]:
build_number name fmax
0 392 adpcm 143.45
1 392 aes 309.60
2 392 dfadd 241.02
3 392 dfdiv 10.80
....
211 392 dfmul 215.98
212 392 dfsin 11.94
213 392 gsm 194.70
214 392 jpeg 197.75
215 392 mips 202.39
216 392 mpeg2 291.29
217 392 sha 243.19
[218 rows x 3 columns]
However, I have since figured out how to get value 392 only, if I used syn_fmax_sort.query('391 < build_number < 393')
, it works/
So my question is: Is the behavior that I observed above when I queried the dataframe wrongly due to a bug in the query()
method?
Pandas DataFrame query() Method The query() method allows you to query the DataFrame. The query() method takes a query expression as a string parameter, which has to evaluate to either True of False. It returns the DataFrame where the result is True according to the query expression.
The size property returns the number of elements in the DataFrame. The number of elements is the number of rows * the number of columns.
Flesh-and-blood are famously lazy. Pandas the package, however, uses Eager Evaluation.
Using datatable, we can read in the CSV file in ~20 seconds. Reading the same file using pandas takes almost 76 seconds! Next, we can also sort faster with datatable. In datatable, this takes ~0.002 seconds, but takes ~0.934 seconds in pandas.
It looks like you had a typo, you probably wanted to use ==
rather than =
, a simple example shows the same problem:
In [286]:
df = pd.DataFrame({'a':np.arange(5)})
df
Out[286]:
a
0 0
1 1
2 2
3 3
4 4
In [287]:
df.query('a = 3')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-287-41cfa0572737> in <module>()
----> 1 df.query('a = 3')
C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in query(self, expr, **kwargs)
1923 # when res is multi-dimensional loc raises, but this is sometimes a
1924 # valid query
-> 1925 return self[res]
1926
1927 def eval(self, expr, **kwargs):
C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
1778 return self._getitem_multilevel(key)
1779 else:
-> 1780 return self._getitem_column(key)
1781
1782 def _getitem_column(self, key):
C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
1785 # get column
1786 if self.columns.is_unique:
-> 1787 return self._get_item_cache(key)
1788
1789 # duplicate columns & possible reduce dimensionaility
C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
1066 res = cache.get(item)
1067 if res is None:
-> 1068 values = self._data.get(item)
1069 res = self._box_item_values(item, values)
1070 cache[item] = res
C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
2856 loc = indexer.item()
2857 else:
-> 2858 raise ValueError("cannot label index with a null key")
2859
2860 return self.iget(loc, fastpath=fastpath)
ValueError: cannot label index with a null key
It looks like internally it's trying to build an index using your query and it then checks the length and as it's 0 it raises a ValueError
it probably should be KeyError
, I don't know how it's evaluated your query but perhaps it's unsupported at the moment the ability to assign values to columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With