For example: <pre class="prettyprint"><code>df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A']) df1.A.value_counts(sort=False) 1 3 2 3 3 3 4 3 5 3 6 3 Name: A, dtype: int64 </code></pre> <hr> <pre class="prettyprint"><code>df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A']) df2.A.value_counts(sort=False) 1 100 2 100 3 100 4 100 5 100 6 100 Name: A, dtype: int64 </code></pre> <hr> In the above examples the <code>value_counts</code> works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the <code>A</code> values are already sorted and counts are also same, but the order of index that is <code>A</code> changed after <code>value_counts</code>. Why is it doing correctly for small counts but not for large counts: <pre class="prettyprint"><code>df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A']) df3.A.value_counts(sort=False) 4 1000 1 1000 5 1000 2 1000 6 1000 3 1000 Name: A, dtype: int64 </code></pre> Here I can do <code>df3.A.value_counts(sort=False).sort_index()</code> or <code>df3.A.value_counts(sort=False).reindex(df.A.unique())</code>. I want to know the reason why it is behaving differently for different counts? Using: <pre class="prettyprint"><code>Numpy version :1.15.2 Pandas version :0.23.4 </code></pre>

This is actually a known problem. If you browse through the source code - <ol> <li> <code>C:\ProgramData\Anaconda3\Lib\site-packages\pandas\core\algorithims.py</code> line <code>581</code> is the original implementation</li> <li>It calls <code>_value_counts_arraylike</code> for <code>int64</code> values when <code>bins=None</code> </li> <li>This function makes a call - <code>keys, counts = htable.value_count_int64(values, dropna)</code> </li> </ol> If you then look at the <code>htable</code> implementation you will conclude that the keys are in an arbitrary order, subject to how the <code>hashtable</code> works. Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want. I guess they can change this to have <code>sort=False</code> mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known). The order is changed from <code>pandas/hashtable.pyx.build_count_table_object()</code>. Resizing of the <code>pymap</code> moves the entries by hashing values. Here is the full discussion

Pandas Series value_counts working differently for different counts

Tags:

python

pandas

numpy

For example:

df1 = pd.DataFrame(np.repeat(np.arange(1,7),3), columns=['A'])

df1.A.value_counts(sort=False)
1    3
2    3
3    3
4    3
5    3
6    3
Name: A, dtype: int64

df2 = pd.DataFrame(np.repeat(np.arange(1,7),100), columns=['A'])

df2.A.value_counts(sort=False)
1    100
2    100
3    100
4    100
5    100
6    100
Name: A, dtype: int64

In the above examples the value_counts works perfectly and give the required result. whereas when coming to larger dataframes it is giving a different output. Here the A values are already sorted and counts are also same, but the order of index that is A changed after value_counts. Why is it doing correctly for small counts but not for large counts:

df3 = pd.DataFrame(np.repeat(np.arange(1,7),1000), columns=['A'])

df3.A.value_counts(sort=False)
4    1000
1    1000
5    1000
2    1000
6    1000
3    1000
Name: A, dtype: int64

Here I can do df3.A.value_counts(sort=False).sort_index() or df3.A.value_counts(sort=False).reindex(df.A.unique()). I want to know the reason why it is behaving differently for different counts?

Using:

Numpy version :1.15.2
Pandas version :0.23.4

687

asked Nov 14 '18 06:11

Space Impact

1 Answers

This is actually a known problem.

If you browse through the source code -

C:\ProgramData\Anaconda3\Lib\site-packages\pandas\core\algorithims.py line 581 is the original implementation
It calls _value_counts_arraylike for int64 values when bins=None
This function makes a call - keys, counts = htable.value_count_int64(values, dropna)

If you then look at the htable implementation you will conclude that the keys are in an arbitrary order, subject to how the hashtable works.

Its not a guarantee of ANY kind of ordering. Typically this routine sorts by biggest values, and that is almost always what you want.

I guess they can change this to have sort=False mean original ordering. I don't know if this would actually break anything (and done internally this isn't very costly as the uniques are already known).

The order is changed from pandas/hashtable.pyx.build_count_table_object(). Resizing of the pymap moves the entries by hashing values.

Here is the full discussion

152

answered Oct 26 '22 17:10

Vivek Kalyanarangan

Related questions
                            
                                printing float precision precision in numpy jupyter notebook
                            
                                Reading a csv file into pandas dataframe with quotation in some entries
                            
                                Reading direct access binary file format in Python
                            
                                Installing my own python module inside a virtual environment
                            
                                Sklearn Decision Rules for Specific Class in Decision tree
                            
                                Django migration with "--fake-initial" is not working if AddField referes to "same" column
                            
                                Error opening TIFF in python unknown pseudo-tag
                            
                                How to change default conda environment in Anaconda Prompt?
                            
                                PyQt5 QMainWindow, QDockWidget, fitting autosize with screensize
                            
                                Get CUDA_HOME environment path PYTORCH
                            
                                How to specify custom error bars in a Seaborn catplot?
                            
                                How to rewrite consecutive 'async with' statements into a loop?
                            
                                How to browse to a folder with plotly dash?
                            
                                Tensorflow: FailedPreconditionError: Table not initialized (using tf.data.Dataset API)
                            
                                Is there a decent workaround to saving checkpoints in local drive when using TPU in Tensorflow?
                            
                                High performance video edit in Python
                            
                                OSError: No Default Input Device Available
                            
                                How to set the "Sec-WebSocket-Protocol" header in python websocket server handshake response?
                            
                                Applying Fourier Transform on Time Series data and avoiding aliasing
                            
                                How to get matplotlib figures in emf/wmf format?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With