I have discovered the pandas DataFrame.query method and it almost does exactly what I needed it to (and implemented my own parser for, since I hadn't realized it existed but really I should be using the standard method).
I would like my users to be able to specify the query in a configuration file. The syntax seems intuitive enough that I can expect my non-programmer (but engineer) users to figure it out.
There's just one thing missing: a way to select everything in the dataframe. Sometimes what my users want to use is every row, so they would put 'All' or something into that configuration option. In fact, that will be the default option.
I tried df.query('True') but that raised a KeyError. I tried df.query('1') but that returned the row with index 1. The empty string raised a ValueError.
The only things I can think of are 1) put an if clause every time I need to do this type of query (probably 3 or 4 times in the code) or 2) subclass DataFrame and either reimplement query, or add a query_with_all method:
import pandas as pd
class MyDataFrame(pd.DataFrame):
def query_with_all(self, query_string):
if query_string.lower() == 'all':
return self
else:
return self.query(query_string)
And then use my own class every time instead of the pandas one. Is this the only way to do this?
A function set_option() is provided by pandas to display all rows of the data frame. display. max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10.
The query() method allows you to query the DataFrame. The query() method takes a query expression as a string parameter, which has to evaluate to either True of False. It returns the DataFrame where the result is True according to the query expression.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
iloc returns a Pandas Series when one row is selected, and a Pandas DataFrame when multiple rows are selected, or if any column in full is selected.
Keep things simple, and use a function:
def query_with_all(data_frame, query_string):
if query_string == "all":
return data_frame
return data_frame.query(query_string)
Whenever you need to use this type of query, just call the function with the data frame and the query string. There's no need to use any extra if
statements or subclass pd.Dataframe
.
If you're restricted to using df.query
, you can use a global variable
ALL = slice(None)
df.query('@ALL', engine='python')
If you're not allowed to use global variables, and if your DataFrame isn't MultiIndexed, you can use
df.query('tuple()')
All of these will property handle NaN
values.
df.query('ilevel_0 in ilevel_0')
will always return the full dataframe, also when the index contains NaN
values or even when the dataframe is completely empty.
In you particular case you could then define a global variable all_true = 'ilevel_0 in ilevel_0'
(as suggested in the comments by Zero) so that your engineers could use the name of the global variable in their config file instead.
This statement is just a dirty way to properly query True
like you already tried. ilevel_0
is a more formal way of making sure you are referring the index. See the docs here for more details on using in
and ilevel_0
: https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-query-method
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With