I have a pandas.DataFrame
containing numerous columns. I am interested in just one of those columns ('names') whose type = 'object'. I want to answer three questions about this column:
What value(s) appear most often excluding nan values?
How many values meet that criteria (count of value in answer #1)?
How often do those values appear?
I started with a large dataframe (df). The column I am interested in is called 'names'. First, I used collection.Counter to get the number of occurrences for each unique value in the 'names' column:
In [52]: cntr = collections.Counter([r for i, r in df['names'].dropna().iteritems()])
Out[52]: Counter({'Erk': 118,
'James': 120,
'John': 126,
'Michael': 129,
'Phil': 117,
'Ryan': 126})
Then I converted the Counter back to a dataframe:
In [53]: df1 = pd.DataFrame.from_dict(cntr, orient='index').reset_index()
In [54]: df1 = df1.rename(columns={'index':'names', 0:'cnt'})
This gave me a pandas dataframe containing:
In [55]: print (type(df1), df1)
Out[55]: <class 'pandas.core.frame.DataFrame'>
names cnt
0 Erk 118
1 James 120
2 Phil 117
3 John 126
4 Michael 122
5 Ryan 126
The next part is where I need a bit of help. My desired output in this example is:
Answer #1 = [John, Ryan]
Answer #2 = 2
Answer #3 = 126
I am not convinced using the Counter was the best option, so I am open to options that stay within the dataframe without bouncing between dataframe to counter back to dataframe.
Pandas DataFrame mode() Method The mode() method returns the mode value of each column. Mean, Median, and Mode: Mean - The average value. Median - The mid point value. Mode - The most common value.
In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.
You can get that information directly from the Counter
like:
from collections import Counter
data = Counter({'Erk': 118, 'James': 120, 'John': 126,
'Michael': 122, 'Phil': 117, 'Ryan': 126})
by_count = {}
for k, v in data.items():
by_count.setdefault(v, []).append(k)
max_value = max(by_count.keys())
print(by_count[max_value], len(by_count[max_value]), max_value)
['John', 'Ryan'] 2 126
There is a helper method that does just what you want : value_counts()
. It is efficient even for large dataframes.
df1 = df['names'].value_counts()
# question 3
q3 = df1.max()
# question 1
q1 = df1.loc[df1 == q3].index.tolist()
# question 2
q2 = len(q1)
Since you mention mode
from scipy import stats
Val,cnt=stats.mode(df1.cnt)
Val
Out[349]: array([126], dtype=int64)
cnt
Out[350]: array([2])
df1.names[df1.cnt.isin(Val)].tolist()
Out[358]: ['John', 'Ryan']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With