I've got a data frame and want to filter or bin by a range of values and then get the counts of values in each bin.
Currently, I'm doing this:
x = 5
y = 17
z = 33
filter_values = [x, y, z]
filtered_a = df[df.filtercol <= x]
a_count = filtered_a.filtercol.count()
filtered_b = df[df.filtercol > x]
filtered_b = filtered_b[filtered_b <= y]
b_count = filtered_b.filtercol.count()
filtered_c = df[df.filtercol > y]
c_count = filtered_c.filtercol.count()
But is there a more concise way to accomplish the same thing?
In Python pandas binning by distance is achieved by means of the cut() function. We group values related to the column Cupcake into three groups: small, medium and big. In order to do it, we need to calculate the intervals within each group falls.
To summarize, if your apps save/load data from disk frequently, then it's a wise decision to leave these operations to PyArrow. Heck, it's 7 times faster for the identical file format. Imagine we introduced Parquet file format to the mix.
Pandas DataFrame duplicated() Method The duplicated() method returns a Series with True and False values that describe which rows in the DataFrame are duplicated and not. Use the subset parameter to specify if any columns should not be considered when looking for duplicates.
Perhaps you are looking for pandas.cut:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(50), columns=['filtercol'])
filter_values = [0, 5, 17, 33]
out = pd.cut(df.filtercol, bins=filter_values)
counts = pd.value_counts(out)
# counts is a Series
print(counts)
yields
(17, 33] 16
(5, 17] 12
(0, 5] 5
To reorder the result so the bin ranges appear in order, you could use
counts.sort_index()
which yields
(0, 5] 5
(5, 17] 12
(17, 33] 16
Thanks to nivniv and InLaw for this improvement.
See also Discretization and quantiling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With