Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas Series.value_counts returns inconsistent order for equal count strings

When I run the code below:

s = pandas.Series(['c', 'a', 'b', 'a', 'b'])
print(s.value_counts())

Sometimes I get this:

a    2
b    2
c    1
dtype: int64

And sometimes I get this:

b    2
a    2
c    1
dtype: int64

e.g. the index order returned for equivalent counts is not the same. I couldn't reproduce this if the Series values are integers instead of strings.

Why does this happen, and what is the most efficient way to get the same index order every time?

I want it to still be sorted in descending order by counts, but to be consistent in the order of equivalent-counts items.

I'm running Python 3.7.0 and pandas 0.23.4

like image 705
Karmen Avatar asked Aug 20 '18 15:08

Karmen


2 Answers

You have a few options to sort consistently given a series:

s = pd.Series(['a', 'b', 'a', 'c', 'c'])
c = s.value_counts()

sort by index

Use pd.Series.sort_index:

res = c.sort_index()

a    2
b    1
c    2
dtype: int64

sort by count (arbitrary for ties)

For descending counts, do nothing, as this is the default. Otherwise, you can use pd.Series.sort_values, which defaults to ascending=True. In either case, you should make no assumptions on how ties are handled.

res = c.sort_values()

b    1
c    2
a    2
dtype: int64

More efficiently, you can use c.iloc[::-1] to reverse the order.

sort by count and then by index

You can use numpy.lexsort to sort by count and then by index. Note the reverse order, i.e. -c.values is used first for sorting.

res = c.iloc[np.lexsort((c.index, -c.values))]

a    2
c    2
b    1
dtype: int64
like image 161
jpp Avatar answered Oct 19 '22 23:10

jpp


Adding a reindex after value_counts

df.value_counts().reindex(df.unique())
Out[353]: 
a    1
b    1
dtype: int64

Update

s.value_counts().sort_index().sort_values()
like image 42
BENY Avatar answered Oct 20 '22 00:10

BENY