I'm starting to learn Pandas and am trying to find the most Pythonic (or panda-thonic?) ways to do certain tasks.
Suppose we have a DataFrame with columns A, B, and C.
What we want to discover is the subtle distinctions between B values for rows that have A set to false, vs. B values for rows that have A is true.
In other words, how can I group by the value of column A (either true or false), then plot the values of column B for both groups on the same graph? The two datasets should be colored differently to be able to distinguish the points.
Next, let's add another feature to this program: before graphing, we want to compute another value for each row and store it in column D. This value is the mean of all data stored in B for the entire five minutes before a record - but we only include rows that have the same boolean value stored in A.
In other words, if I have a row where A=True
and time=t
, I want to compute a value for column D that is the mean of B for all records from time t-5
to t
that have the same A=True
.
In this case, how can we execute the groupby on values of A, then apply this computation to each individual group, and finally plot the D values for the two groups?
Group DataFrame using a mapper or by a Series of columns. A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
1) Using pandas groupby size() method The most simple method for pandas groupby count is by using the in-built pandas method named size(). It returns a pandas series that possess the total number of row count for each group.
So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe. For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame.
I think @herrfz hit all the high points. I'll just flesh out the details:
import pandas as pd import numpy as np import matplotlib.pyplot as plt sin = np.sin cos = np.cos pi = np.pi N = 100 x = np.linspace(0, pi, N) a = sin(x) b = cos(x) df = pd.DataFrame({ 'A': [True]*N + [False]*N, 'B': np.hstack((a,b)) }) for key, grp in df.groupby(['A']): plt.plot(grp['B'], label=key) grp['D'] = pd.rolling_mean(grp['B'], window=5) plt.plot(grp['D'], label='rolling ({k})'.format(k=key)) plt.legend(loc='best') plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With