Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a query method or similar for pandas Series (pandas.Series.query())?

The pandas.DataFrame.query() method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.

I find myself often wanting to apply the same logic to a pandas.Series, e.g. after having done a method such as df.value_counts which returns a pandas.Series.

Example

Lets assume there is a huge table with the columns Player, Game, Points and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg) which will return a Series of ~1000 players and their overall points. Applying the .query logic it would look something like this:

df = pd.DataFrame({     'Points': [random.choice([1,3]) for x in range(100)],      'Player': [random.choice(["A","B","C"]) for x in range(100)]})  (df      .query("Points == 3")      .Player.values_count()      .query("> 14")      .hist()) 

The only solutions I find force me to do an unnecessary assignment and break the method chaining:

(points_series = df      .query("Points == 3")      .groupby("Player").size() points_series[points_series > 100].hist() 

Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.

# just to make my point :) series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape 

Please help me out of my dilemma! Thanks

like image 523
dmeu Avatar asked Oct 21 '16 08:10

dmeu


People also ask

What is the difference between a Pandas series and a Pandas DataFrame?

Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.

Can you use ILOC on a series?

iloc attribute enables purely integer-location based indexing for selection by position over the given Series object. Example #1: Use Series. iloc attribute to perform indexing over the given Series object.

What is series method in pandas?

#pandas series # pandas create series. A pandas Series is a one-dimensional labelled data structure which can hold data such as strings, integers and even other Python objects. It is built on top of numpy array and is the primary data structure to hold one-dimensional data in pandas.

Is a pandas series the same as a Numpy array?

The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.


2 Answers

If I understand correctly you can add query("Points > 100"):

df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],                    'Player':['a','a','a','s','s','s']})  print (df)   Player     Points 0      a  50.000000 1      a  20.000000 2      a  38.000000 3      s  90.000000 4      s   0.000000 5      s        inf  points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points'] print (points_series)      a = points_series[points_series > 100] print (a)      Player a    108.0 Name: Points, dtype: float64   points_series = df.query("Points < inf")                   .groupby("Player")                   .agg({"Points": "sum"})                   .query("Points > 100")  print (points_series)              Points Player         a        108.0 

Another solution is Selection By Callable:

points_series = df.query("Points < inf")                   .groupby("Player")                   .agg({"Points": "sum"})['Points']                   .loc[lambda x: x > 100]  print (points_series)      Player a    108.0 Name: Points, dtype: float64 

Edited answer by edited question:

np.random.seed(1234) df = pd.DataFrame({     'Points': [np.random.choice([1,3]) for x in range(100)],      'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})  print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15]) C    19 B    16 Name: Player, dtype: int64  print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15]) Player B    16 C    19 dtype: int64 
like image 96
jezrael Avatar answered Sep 20 '22 21:09

jezrael


Why not convert from Series to DataFrame, do the querying, and then convert back.

df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"] 

Here, .to_frame() converts to DataFrame, while the trailing ["Points"] converts to Series.

The method .query() can then be used consistently whether or not the Pandas object has 1 or more columns.

like image 31
Martin Avatar answered Sep 16 '22 21:09

Martin