Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pandas.cut() behave differently in unique count in two similar cases?

In the first case, I use a very simple DataFrame to try using pandas.cut() to count the number of unique values in one column within a range of another column. The code runs as expected:

enter image description here

However, in the following code, pandas.cut() counts the number of unique values wrong. I expect the first bin (1462320000, 1462406400] to have 5 unique values, and other bins including the last bin (1462752000, 1462838400] to have 0 unique values.

Instead, as shown in the result, the code returns 5 unique values in the last bin (1462752000, 1462838400], while the 2 highlighted values should not be counted because they are out of range.

enter image description here

So could anyone explain why pandas.cut() behaves so different in these 2 cases? And also, I would be really thankful if you can also tell me how can I correct the code to correctly count the number of unique values in one column within a range of value of another column.


ADDITIONNAL INFO: (please import pandas and numpy to run the code, my pandas version is 0.19.2, and I am using python 2.7)

For your ready reference, I hereby post my DataFrame and the codes for you to reproduce my code:

Case 1:

df = pd.DataFrame({'No': [1,1.5,2,1,3,5,10], 'useragent': ['a', 'c', 'b', 'c', 'b','a','z']})
print type(df)
print df
df.groupby(pd.cut(df['No'], bins=np.arange(0,4,1))).useragent.nunique()

Case 2:

print type(df)
print len(df)
print df.time.nunique()
print df.hash.nunique()
print df[['time','hash']]
df.groupby(pd.cut(df['time'], bins =np.arange(1462320000,1462924800,86400))).hash.nunique()

Case 2's Data:

time      hash
1462328401 qo
1462328401 qQ
1462838401 q1
1462328401 q1
1462328401 qU
1462328401 qU
1462328401 qU
1462328401 qU
1462328401 qX
1462838401 qX
like image 387
weefwefwqg3 Avatar asked Feb 20 '17 14:02

weefwefwqg3


People also ask

How do pandas count unique values?

In order to get the count of unique values on multiple columns use pandas DataFrame. drop_duplicates() which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.

What is the difference between QCUT and cut in pandas?

The major distinction is that qcut will calculate the size of each bin in order to make sure the distribution of data in the bins is equal. In other words, all bins will have (roughly) the same number of observations but the bin range will vary. On the other hand, cut is used to specifically define the bin edges.

What does unique () do in Python pandas?

The unique() function is used to get unique values of Series object. Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort. The unique values returned as a NumPy array.

How does PD cut work?

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.


1 Answers

It's seems to be a bug.

On a simple example :

In [50]: df=pd.DataFrame({'atime': [28]*8+[38]*2, 'hash':randint(0,3,10)}
).sort_values('hash')
Out[50]: 
      atime  hash
1     28     0
3     28     0
4     28     0
5     28     0
8     38     0
2     28     1
6     28     1
0     28     2
7     28     2
9     38     2 

In [50bis;)]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.unique()
Out[50bis]: 
atime
(27, 29]                   [0, 1, 2]   # ok
(29, 31]                          []
(31, 33]                          []
(33, 35]                          []
(35, 37]                          []
(37, 39]                      [0, 2]
Name: hash, dtype: object

In [51]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.nunique()
Out[51]: 
atime
(27, 29]    2 # bug
(29, 31]    0
(31, 33]    0
(33, 35]    0
(35, 37]    0
(37, 39]    2
Name: hash, dtype: int64

Here seems to be a efficient workaround, converting the cut result in a list :

In [52]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2)).tolist()
).hash.nunique()
Out[52]: 
atime
(27, 29]    3
(37, 39]    2
Name: hash, dtype: int64
like image 52
B. M. Avatar answered Oct 21 '22 23:10

B. M.