When I run the code below:
s = pandas.Series(['c', 'a', 'b', 'a', 'b'])
print(s.value_counts())
Sometimes I get this:
a 2
b 2
c 1
dtype: int64
And sometimes I get this:
b 2
a 2
c 1
dtype: int64
e.g. the index order returned for equivalent counts is not the same. I couldn't reproduce this if the Series values are integers instead of strings.
Why does this happen, and what is the most efficient way to get the same index order every time?
I want it to still be sorted in descending order by counts, but to be consistent in the order of equivalent-counts items.
I'm running Python 3.7.0 and pandas 0.23.4
You have a few options to sort consistently given a series:
s = pd.Series(['a', 'b', 'a', 'c', 'c'])
c = s.value_counts()
Use pd.Series.sort_index
:
res = c.sort_index()
a 2
b 1
c 2
dtype: int64
For descending counts, do nothing, as this is the default. Otherwise, you can use pd.Series.sort_values
, which defaults to ascending=True
. In either case, you should make no assumptions on how ties are handled.
res = c.sort_values()
b 1
c 2
a 2
dtype: int64
More efficiently, you can use c.iloc[::-1]
to reverse the order.
You can use numpy.lexsort
to sort by count and then by index. Note the reverse order, i.e. -c.values
is used first for sorting.
res = c.iloc[np.lexsort((c.index, -c.values))]
a 2
c 2
b 1
dtype: int64
Adding a reindex
after value_counts
df.value_counts().reindex(df.unique())
Out[353]:
a 1
b 1
dtype: int64
Update
s.value_counts().sort_index().sort_values()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With