Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas DataFrame.query expression that returns all rows by default

I have discovered the pandas DataFrame.query method and it almost does exactly what I needed it to (and implemented my own parser for, since I hadn't realized it existed but really I should be using the standard method).

I would like my users to be able to specify the query in a configuration file. The syntax seems intuitive enough that I can expect my non-programmer (but engineer) users to figure it out.

There's just one thing missing: a way to select everything in the dataframe. Sometimes what my users want to use is every row, so they would put 'All' or something into that configuration option. In fact, that will be the default option.

I tried df.query('True') but that raised a KeyError. I tried df.query('1') but that returned the row with index 1. The empty string raised a ValueError.

The only things I can think of are 1) put an if clause every time I need to do this type of query (probably 3 or 4 times in the code) or 2) subclass DataFrame and either reimplement query, or add a query_with_all method:

import pandas as pd

class MyDataFrame(pd.DataFrame):
    def query_with_all(self, query_string):
        if query_string.lower() == 'all':
            return self
        else:
            return self.query(query_string)

And then use my own class every time instead of the pandas one. Is this the only way to do this?

like image 383
moink Avatar asked Oct 19 '17 03:10

moink


People also ask

How do I get Pandas to show all rows?

A function set_option() is provided by pandas to display all rows of the data frame. display. max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10.

What does .query do in Python?

The query() method allows you to query the DataFrame. The query() method takes a query expression as a string parameter, which has to evaluate to either True of False. It returns the DataFrame where the result is True according to the query expression.

Is Pandas query faster than LOC?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

What is ILOC return?

iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.


2 Answers

Keep things simple, and use a function:

def query_with_all(data_frame, query_string):
    if query_string == "all":
        return data_frame
    return data_frame.query(query_string)

Whenever you need to use this type of query, just call the function with the data frame and the query string. There's no need to use any extra if statements or subclass pd.Dataframe.


If you're restricted to using df.query, you can use a global variable

ALL = slice(None)
df.query('@ALL', engine='python')

If you're not allowed to use global variables, and if your DataFrame isn't MultiIndexed, you can use

df.query('tuple()')

All of these will property handle NaN values.

like image 195
Joshua Avatar answered Oct 20 '22 00:10

Joshua


df.query('ilevel_0 in ilevel_0') will always return the full dataframe, also when the index contains NaN values or even when the dataframe is completely empty.

In you particular case you could then define a global variable all_true = 'ilevel_0 in ilevel_0' (as suggested in the comments by Zero) so that your engineers could use the name of the global variable in their config file instead.

This statement is just a dirty way to properly query True like you already tried. ilevel_0 is a more formal way of making sure you are referring the index. See the docs here for more details on using in and ilevel_0: https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method

like image 34
jorijnsmit Avatar answered Oct 20 '22 00:10

jorijnsmit