The pandas.DataFrame.query()
method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.
I find myself often wanting to apply the same logic to a pandas.Series
, e.g. after having done a method such as df.value_counts
which returns a pandas.Series
.
Lets assume there is a huge table with the columns Player, Game, Points
and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg
) which will return a Series of ~1000 players and their overall points. Applying the .query
logic it would look something like this:
df = pd.DataFrame({ 'Points': [random.choice([1,3]) for x in range(100)], 'Player': [random.choice(["A","B","C"]) for x in range(100)]}) (df .query("Points == 3") .Player.values_count() .query("> 14") .hist())
The only solutions I find force me to do an unnecessary assignment and break the method chaining:
(points_series = df .query("Points == 3") .groupby("Player").size() points_series[points_series > 100].hist()
Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.
# just to make my point :) series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Please help me out of my dilemma! Thanks
Series can only contain single list with index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyse the data.
iloc attribute enables purely integer-location based indexing for selection by position over the given Series object. Example #1: Use Series. iloc attribute to perform indexing over the given Series object.
#pandas series # pandas create series. A pandas Series is a one-dimensional labelled data structure which can hold data such as strings, integers and even other Python objects. It is built on top of numpy array and is the primary data structure to hold one-dimensional data in pandas.
The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.
If I understand correctly you can add query("Points > 100")
:
df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf], 'Player':['a','a','a','s','s','s']}) print (df) Player Points 0 a 50.000000 1 a 20.000000 2 a 38.000000 3 s 90.000000 4 s 0.000000 5 s inf points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points'] print (points_series) a = points_series[points_series > 100] print (a) Player a 108.0 Name: Points, dtype: float64 points_series = df.query("Points < inf") .groupby("Player") .agg({"Points": "sum"}) .query("Points > 100") print (points_series) Points Player a 108.0
Another solution is Selection By Callable:
points_series = df.query("Points < inf") .groupby("Player") .agg({"Points": "sum"})['Points'] .loc[lambda x: x > 100] print (points_series) Player a 108.0 Name: Points, dtype: float64
Edited answer by edited question:
np.random.seed(1234) df = pd.DataFrame({ 'Points': [np.random.choice([1,3]) for x in range(100)], 'Player': [np.random.choice(["A","B","C"]) for x in range(100)]}) print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15]) C 19 B 16 Name: Player, dtype: int64 print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15]) Player B 16 C 19 dtype: int64
Why not convert from Series to DataFrame, do the querying, and then convert back.
df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]
Here, .to_frame()
converts to DataFrame, while the trailing ["Points"]
converts to Series.
The method .query()
can then be used consistently whether or not the Pandas object has 1 or more columns.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With