The documentation says: http://pandas.pydata.org/pandas-docs/dev/basics.html "Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions" Sounds very abstract to me... I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut? Thanks. <pre class="prettyprint"><code>factors = np.random.randn(30) In [11]: pd.cut(factors, 5) Out[11]: [(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]] Length: 30 Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]] In [14]: pd.qcut(factors, 5) Out[14]: [(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]] Length: 30 Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]` </code></pre>

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking <code>qcut</code> for quintiles. So, when you ask for quintiles with <code>qcut</code>, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw): <pre class="prettyprint"><code>pd.qcut(factors, 5).value_counts() [-2.578, -0.829] 6 (-0.829, -0.36] 6 (-0.36, 0.366] 6 (0.366, 0.868] 6 (0.868, 2.617] 6 </code></pre> Conversely, for <code>cut</code> you will see something more uneven: <pre class="prettyprint"><code>pd.cut(factors, 5).value_counts() (-2.583, -1.539] 5 (-1.539, -0.5] 5 (-0.5, 0.539] 9 (0.539, 1.578] 9 (1.578, 2.617] 2 </code></pre> That's because <code>cut</code> will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you'll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

What is the difference between pandas.qcut and pandas.cut?

Tags:

python

pandas

The documentation says:

http://pandas.pydata.org/pandas-docs/dev/basics.html

"Continuous values can be discretized using the cut (bins based on values) and qcut (bins based on sample quantiles) functions"

Sounds very abstract to me... I can see the differences in the example below but what does qcut (sample quantile) actually do/mean? When would you use qcut versus cut?

Thanks.

factors = np.random.randn(30)  In [11]: pd.cut(factors, 5) Out[11]: [(-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (-0.411, 0.575], (0.575, 1.561], ..., (-0.411, 0.575], (-1.397, -0.411], (0.575, 1.561], (-2.388, -1.397], (-0.411, 0.575]] Length: 30 Categories (5, object): [(-2.388, -1.397] < (-1.397, -0.411] < (-0.411, 0.575] < (0.575, 1.561] < (1.561, 2.547]]  In [14]: pd.qcut(factors, 5) Out[14]: [(-0.348, 0.0899], (-0.348, 0.0899], (0.0899, 1.19], (0.0899, 1.19], (0.0899, 1.19], ..., (0.0899, 1.19], (-1.137, -0.348], (1.19, 2.547], [-2.383, -1.137], (-0.348, 0.0899]] Length: 30 Categories (5, object): [[-2.383, -1.137] < (-1.137, -0.348] < (-0.348, 0.0899] < (0.0899, 1.19] < (1.19, 2.547]]`

450

asked May 13 '15 10:05

WillZ

1 Answers

To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut for quintiles.

So, when you ask for quintiles with qcut, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):

pd.qcut(factors, 5).value_counts()  [-2.578, -0.829]    6 (-0.829, -0.36]     6 (-0.36, 0.366]      6 (0.366, 0.868]      6 (0.868, 2.617]      6

Conversely, for cut you will see something more uneven:

pd.cut(factors, 5).value_counts()  (-2.583, -1.539]    5 (-1.539, -0.5]      5 (-0.5, 0.539]       9 (0.539, 1.578]      9 (1.578, 2.617]      2

That's because cut will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you'll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).

139

answered Sep 20 '22 19:09

JohnE

Related questions
                            
                                Why doesn't requests.get() return? What is the default timeout that requests.get() uses?
                            
                                Counting the number of non-NaN elements in a numpy ndarray in Python
                            
                                How to implement the --verbose or -v option into a script?
                            
                                How to execute ipdb.set_trace() at will while running pytest tests
                            
                                Platform independent path concatenation using "/" , "\"?
                            
                                method of iterating over sqlalchemy model's defined columns?
                            
                                Get an attribute value based on the name attribute with BeautifulSoup
                            
                                Python strip with \n [duplicate]
                            
                                Create a file if it doesn't exist
                            
                                Convert number strings with commas in pandas DataFrame to float
                            
                                Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign
                            
                                How to generate keyboard events?
                            
                                How to create a user in Django?
                            
                                How do I convert a list into a string with spaces in Python?
                            
                                Python os.path.join on Windows
                            
                                How to apply a logical operator to all elements in a python list
                            
                                Python str vs unicode types
                            
                                Getting name of windows computer running python script?
                            
                                heapq with custom compare predicate
                            
                                How to import a text file on AWS S3 into pandas without writing to disk

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With