Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bug in pandas query() method?

I was experimenting several use cases for the pandas query() method, and tried one argument that threw an exception, but yet caused an unwanted modification to the data in my DataFrame.

In [549]: syn_fmax_sort
Out[549]: 
     build_number      name    fmax
0             390     adpcm  143.45
1             390       aes  309.60
2             390     dfadd  241.02
3             390     dfdiv   10.80
....
211           413     dfmul  215.98
212           413     dfsin   11.94
213           413       gsm  194.70
214           413      jpeg  197.75
215           413      mips  202.39
216           413     mpeg2  291.29
217           413       sha  243.19

[218 rows x 3 columns]

So I wanted to use query() to just take out a subset of this dataframe that contains all the build_number of 392, so I tried:

In [550]: syn_fmax_sort.query('build_number = 392')

That threw a ValueError: cannot label index with a null key exception, but not only that, it returned back the full dataframe to me,and caused all the build_number to be set to 392:

In [551]: syn_fmax_sort
Out[551]: 
     build_number      name    fmax
0             392     adpcm  143.45
1             392       aes  309.60
2             392     dfadd  241.02
3             392     dfdiv   10.80
....
211           392     dfmul  215.98
212           392     dfsin   11.94
213           392       gsm  194.70
214           392      jpeg  197.75
215           392      mips  202.39
216           392     mpeg2  291.29
217           392       sha  243.19

[218 rows x 3 columns]

However, I have since figured out how to get value 392 only, if I used syn_fmax_sort.query('391 < build_number < 393'), it works/

So my question is: Is the behavior that I observed above when I queried the dataframe wrongly due to a bug in the query() method?

like image 936
AKKO Avatar asked Feb 25 '15 08:02

AKKO


People also ask

What does query do in pandas?

Pandas DataFrame query() Method The query() method allows you to query the DataFrame. The query() method takes a query expression as a string parameter, which has to evaluate to either True of False. It returns the DataFrame where the result is True according to the query expression.

What does size () do in pandas?

The size property returns the number of elements in the DataFrame. The number of elements is the number of rows * the number of columns.

Does pandas use lazy evaluation?

Flesh-and-blood are famously lazy. Pandas the package, however, uses Eager Evaluation.

Which is faster than pandas?

Using datatable, we can read in the CSV file in ~20 seconds. Reading the same file using pandas takes almost 76 seconds! Next, we can also sort faster with datatable. In datatable, this takes ~0.002 seconds, but takes ~0.934 seconds in pandas.


1 Answers

It looks like you had a typo, you probably wanted to use == rather than =, a simple example shows the same problem:

In [286]:

df = pd.DataFrame({'a':np.arange(5)})
df
Out[286]:
   a
0  0
1  1
2  2
3  3
4  4
In [287]:

df.query('a = 3')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-287-41cfa0572737> in <module>()
----> 1 df.query('a = 3')

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in query(self, expr, **kwargs)
   1923             # when res is multi-dimensional loc raises, but this is sometimes a
   1924             # valid query
-> 1925             return self[res]
   1926 
   1927     def eval(self, expr, **kwargs):

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   1778             return self._getitem_multilevel(key)
   1779         else:
-> 1780             return self._getitem_column(key)
   1781 
   1782     def _getitem_column(self, key):

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\frame.py in _getitem_column(self, key)
   1785         # get column
   1786         if self.columns.is_unique:
-> 1787             return self._get_item_cache(key)
   1788 
   1789         # duplicate columns & possible reduce dimensionaility

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\generic.py in _get_item_cache(self, item)
   1066         res = cache.get(item)
   1067         if res is None:
-> 1068             values = self._data.get(item)
   1069             res = self._box_item_values(item, values)
   1070             cache[item] = res

C:\WinPython-64bit-3.4.2.4\python-3.4.2.amd64\lib\site-packages\pandas\core\internals.py in get(self, item, fastpath)
   2856                         loc = indexer.item()
   2857                     else:
-> 2858                         raise ValueError("cannot label index with a null key")
   2859 
   2860             return self.iget(loc, fastpath=fastpath)

ValueError: cannot label index with a null key

It looks like internally it's trying to build an index using your query and it then checks the length and as it's 0 it raises a ValueError it probably should be KeyError, I don't know how it's evaluated your query but perhaps it's unsupported at the moment the ability to assign values to columns.

like image 170
EdChum Avatar answered Oct 10 '22 06:10

EdChum