I have a dataframe with two columns, one for names and one for string values. I'm trying to count frequency of select string values by names.
I've tried pandas.pivot_table and pandas.DataFrame.groupby but I'd like to create a whole new dataframe rather than aggregation.
For example, I have a dataframe:
import pandas as pd
import numpy as np
data = np.array([['John', 'x'], ['John', 'x'], ['John', 'x'], ['John', 'y'], ['John', 'y'], ['John', 'a'],
['Will', 'x'], ['Will', 'z']])
df = pd.DataFrame(data, columns=['name','str_value'])
df
which results in:
name str_value
0 John x
1 John x
2 John x
3 John y
4 John y
5 John a
6 Will x
7 Will z
An expected result would be:
name x y z
0 John 3 2 0
1 Will 1 0 1
and additionally:
name x y z
0 John True True False
1 Will True False True
I'd like to select x, y, z only and return True or False based on whether the returned value is 0 or NaN.
Edit: Thank you for the answers. These work great, but the output has the subgroup "str_value":
str_value x y z
name
John True True False
Will True False True
Is there a way to remove this so I have "name", "x", "y", "z" on the same level? With .reset_index() I get:
str_value name x y z
0 John True True False
1 Will True False True
Is the name of my index "str_value" now? Can I rename or drop this?
With a mix of groupby and pivot:
total = df.groupby(["name", "str_value"]).size().reset_index(level=1, name="total")
counts = total.pivot(columns="str_value", values="total").fillna(0).drop(columns=["a"])
bools = counts > 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With