The <code>pandas.DataFrame.query()</code> method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining. I find myself often wanting to apply the same logic to a <code>pandas.Series</code>, e.g. after having done a method such as <code>df.value_counts</code> which returns a <code>pandas.Series</code>. <h3>Example</h3> Lets assume there is a huge table with the columns <code>Player, Game, Points</code> and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (<code>groupby -> agg</code>) which will return a Series of ~1000 players and their overall points. Applying the <code>.query</code> logic it would look something like this: <pre class="prettyprint"><code>df = pd.DataFrame({ 'Points': [random.choice([1,3]) for x in range(100)], 'Player': [random.choice(["A","B","C"]) for x in range(100)]}) (df .query("Points == 3") .Player.values_count() .query("> 14") .hist()) </code></pre> The only solutions I find force me to do an unnecessary assignment and break the method chaining: <pre class="prettyprint"><code>(points_series = df .query("Points == 3") .groupby("Player").size() points_series[points_series > 100].hist() </code></pre> Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly. <pre class="prettyprint"><code># just to make my point :) series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape </code></pre> Please help me out of my dilemma! Thanks

If I understand correctly you can add <code>query("Points > 100")</code>: <pre class="prettyprint"><code>df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf], 'Player':['a','a','a','s','s','s']}) print (df) Player Points 0 a 50.000000 1 a 20.000000 2 a 38.000000 3 s 90.000000 4 s 0.000000 5 s inf points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points'] print (points_series) a = points_series[points_series > 100] print (a) Player a 108.0 Name: Points, dtype: float64 points_series = df.query("Points < inf") .groupby("Player") .agg({"Points": "sum"}) .query("Points > 100") print (points_series) Points Player a 108.0 </code></pre> <hr> Another solution is Selection By Callable: <pre class="prettyprint"><code>points_series = df.query("Points < inf") .groupby("Player") .agg({"Points": "sum"})['Points'] .loc[lambda x: x > 100] print (points_series) Player a 108.0 Name: Points, dtype: float64 </code></pre> <hr> Edited answer by edited question: <pre class="prettyprint"><code>np.random.seed(1234) df = pd.DataFrame({ 'Points': [np.random.choice([1,3]) for x in range(100)], 'Player': [np.random.choice(["A","B","C"]) for x in range(100)]}) print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15]) C 19 B 16 Name: Player, dtype: int64 print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15]) Player B 16 C 19 dtype: int64 </code></pre>

Why not convert from Series to DataFrame, do the querying, and then convert back. <pre class="prettyprint"><code>df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"] </code></pre> Here, <code>.to_frame()</code> converts to DataFrame, while the trailing <code>["Points"]</code> converts to Series. The method <code>.query()</code> can then be used consistently whether or not the Pandas object has 1 or more columns.

Is there a query method or similar for pandas Series (pandas.Series.query())?

Example

Lets assume there is a huge table with the columns Player, Game, Points and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg) which will return a Series of ~1000 players and their overall points. Applying the .query logic it would look something like this:

df = pd.DataFrame({     'Points': [random.choice([1,3]) for x in range(100)],      'Player': [random.choice(["A","B","C"]) for x in range(100)]})  (df      .query("Points == 3")      .Player.values_count()      .query("> 14")      .hist())

The only solutions I find force me to do an unnecessary assignment and break the method chaining:

(points_series = df      .query("Points == 3")      .groupby("Player").size() points_series[points_series > 100].hist()

Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.

# just to make my point :) series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape

Please help me out of my dilemma! Thanks

523

asked Oct 21 '16 08:10

dmeu

2 Answers

If I understand correctly you can add query("Points > 100"):

df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],                    'Player':['a','a','a','s','s','s']})  print (df)   Player     Points 0      a  50.000000 1      a  20.000000 2      a  38.000000 3      s  90.000000 4      s   0.000000 5      s        inf  points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points'] print (points_series)      a = points_series[points_series > 100] print (a)      Player a    108.0 Name: Points, dtype: float64   points_series = df.query("Points < inf")                   .groupby("Player")                   .agg({"Points": "sum"})                   .query("Points > 100")  print (points_series)              Points Player         a        108.0

Another solution is Selection By Callable:

points_series = df.query("Points < inf")                   .groupby("Player")                   .agg({"Points": "sum"})['Points']                   .loc[lambda x: x > 100]  print (points_series)      Player a    108.0 Name: Points, dtype: float64

Edited answer by edited question:

np.random.seed(1234) df = pd.DataFrame({     'Points': [np.random.choice([1,3]) for x in range(100)],      'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})  print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15]) C    19 B    16 Name: Player, dtype: int64  print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15]) Player B    16 C    19 dtype: int64

answered Sep 20 '22 21:09

jezrael

Why not convert from Series to DataFrame, do the querying, and then convert back.

df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]

Here, .to_frame() converts to DataFrame, while the trailing ["Points"] converts to Series.

The method .query() can then be used consistently whether or not the Pandas object has 1 or more columns.

answered Sep 16 '22 21:09

Martin

Related questions
                            
                                How can I use matplotlib.pyplot in a docker container?
                            
                                Splitting a string into an iterator
                            
                                Why does Django call it "views.py" instead of controller? [duplicate]
                            
                                python mysqldb multiple cursors for one connection
                            
                                Python: sharing common code among a family of scripts
                            
                                How to include external library with python wheel package
                            
                                Can Anaconda be packaged for a portable zero-configuration install?
                            
                                How can I get Selenium Web Driver to wait for an element to be accessible, not just present?
                            
                                Interrupt (pause) running Python program in pdb?
                            
                                Python center string using format specifier
                            
                                How do I unit testing my GUI program with Python and PyQt?
                            
                                Read a large csv into a sparse pandas dataframe in a memory efficient way
                            
                                "MetaClass", "__new__", "cls" and "super" - what is the mechanism exactly?
                            
                                How do I uninstall a Python module (“egg”) that I installed with easy_install?
                            
                                Is there an "enhanced" numpy/scipy dot method?
                            
                                Is it possible to add a where clause with list comprehension?
                            
                                How to use flake8 for Python 3 ?
                            
                                Convert a simple one line string to RDD in Spark
                            
                                How can I pass arguments to a docker container with a python entry-point script using command?
                            
                                How to show warnings in py.test

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there a query method or similar for pandas Series (pandas.Series.query())?

Tags:

python

pandas

dataframe

series

method-chaining