Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Series value_counts working differently for different counts

For example:

df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])

df1.A.value_counts(sort=False)
1    3
2    3
3    3
4    3
5    3
6    3
Name: A, dtype: int64

df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])

df2.A.value_counts(sort=False)
1    100
2    100
3    100
4    100
5    100
6    100
Name: A, dtype: int64

In the above examples the value_counts works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A values are already sorted and counts are also same, but the order of index that is A changed after value_counts. Why is it doing correctly for small counts but not for large counts:

df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])

df3.A.value_counts(sort=False)
4    1000
1    1000
5    1000
2    1000
6    1000
3    1000
Name: A, dtype: int64

Here I can do df3.A.value_counts(sort=False).sort_index() or df3.A.value_counts(sort=False).reindex(df.A.unique()). I want to know the reason why it is behaving differently for different counts?

Using:

Numpy version :1.15.2
Pandas version :0.23.4
like image 687
Space Impact Avatar asked Nov 14 '18 06:11

Space Impact


People also ask

What is the difference between count and Value_counts in pandas?

count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.

What does Value_counts () do in pandas?

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

What pandas function returns a series with the counts of each unique value in a column?

The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.

Is pandas Value_counts sorted?

Pandas value_counts returns an object containing counts of unique values in a pandas dataframe in sorted order.


1 Answers

This is actually a known problem.

If you browse through the source code -

  1. C:\ProgramData\Anaconda3\Lib\site-packages\pandas\core\algorithims.py line 581 is the original implementation
  2. It calls _value_counts_arraylike for int64 values when bins=None
  3. This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.

Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.

I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).

The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.

Here is the full discussion

like image 152
Vivek Kalyanarangan Avatar answered Oct 26 '22 17:10

Vivek Kalyanarangan