For example:
df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])
df1.A.value_counts(sort=False)
1 3
2 3
3 3
4 3
5 3
6 3
Name: A, dtype: int64
df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])
df2.A.value_counts(sort=False)
1 100
2 100
3 100
4 100
5 100
6 100
Name: A, dtype: int64
In the above examples the value_counts
works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A
values are already sorted and counts are also same, but the order of index that is A
changed after value_counts
. Why is it doing correctly for small counts but not for large counts:
df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])
df3.A.value_counts(sort=False)
4 1000
1 1000
5 1000
2 1000
6 1000
3 1000
Name: A, dtype: int64
Here I can do df3.A.value_counts(sort=False).sort_index()
or df3.A.value_counts(sort=False).reindex(df.A.unique())
. I want to know the reason why it is behaving differently for different counts?
Using:
Numpy version :1.15.2
Pandas version :0.23.4
count() should be used when you want to find the frequency of valid values present in columns with respect to specified col . . value_counts() should be used to find the frequencies of a series.
Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.
The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
Pandas value_counts returns an object containing counts of unique values in a pandas dataframe in sorted order.
This is actually a known problem.
If you browse through the source code -
C:\ProgramData\Anaconda3\Lib\site-packages\pandas\core\algorithims.py
line 581
is the original implementation_value_counts_arraylike
for int64
values when bins=None
keys, counts = htable.value_count_int64(values, dropna)
If you then look at the htable
implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable
works.
Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.
I guess they can change this to have sort=False
mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).
The order is changed from pandas/hashtable.pyx.build_count_table_object()
. Resizing of the pymap
moves the entries by hashing values.
Here is the full discussion
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With