In the first case, I use a very simple DataFrame
to try using pandas.cut()
to count the number of unique values in one column within a range of another column. The code runs as expected:
However, in the following code, pandas.cut()
counts the number of unique values wrong. I expect the first bin (1462320000, 1462406400] to have 5 unique values, and other bins including the last bin (1462752000, 1462838400] to have 0 unique values.
Instead, as shown in the result, the code returns 5 unique values in the last bin (1462752000, 1462838400], while the 2 highlighted values should not be counted because they are out of range.
So could anyone explain why pandas.cut()
behaves so different in these 2 cases? And also, I would be really thankful if you can also tell me how can I correct the code to correctly count the number of unique values in one column within a range of value of another column.
ADDITIONNAL INFO: (please import pandas
and numpy
to run the code, my pandas version is 0.19.2, and I am using python 2.7)
For your ready reference, I hereby post my DataFrame
and the codes for you to reproduce my code:
Case 1:
df = pd.DataFrame({'No': [1,1.5,2,1,3,5,10], 'useragent': ['a', 'c', 'b', 'c', 'b','a','z']})
print type(df)
print df
df.groupby(pd.cut(df['No'], bins=np.arange(0,4,1))).useragent.nunique()
Case 2:
print type(df)
print len(df)
print df.time.nunique()
print df.hash.nunique()
print df[['time','hash']]
df.groupby(pd.cut(df['time'], bins =np.arange(1462320000,1462924800,86400))).hash.nunique()
Case 2's Data:
time hash
1462328401 qo
1462328401 qQ
1462838401 q1
1462328401 q1
1462328401 qU
1462328401 qU
1462328401 qU
1462328401 qU
1462328401 qX
1462838401 qX
In order to get the count of unique values on multiple columns use pandas DataFrame. drop_duplicates() which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.
The major distinction is that qcut will calculate the size of each bin in order to make sure the distribution of data in the bins is equal. In other words, all bins will have (roughly) the same number of observations but the bin range will vary. On the other hand, cut is used to specifically define the bin edges.
The unique() function is used to get unique values of Series object. Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort. The unique values returned as a NumPy array.
Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.
It's seems to be a bug.
On a simple example :
In [50]: df=pd.DataFrame({'atime': [28]*8+[38]*2, 'hash':randint(0,3,10)}
).sort_values('hash')
Out[50]:
atime hash
1 28 0
3 28 0
4 28 0
5 28 0
8 38 0
2 28 1
6 28 1
0 28 2
7 28 2
9 38 2
In [50bis;)]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.unique()
Out[50bis]:
atime
(27, 29] [0, 1, 2] # ok
(29, 31] []
(31, 33] []
(33, 35] []
(35, 37] []
(37, 39] [0, 2]
Name: hash, dtype: object
In [51]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2))).hash.nunique()
Out[51]:
atime
(27, 29] 2 # bug
(29, 31] 0
(31, 33] 0
(33, 35] 0
(35, 37] 0
(37, 39] 2
Name: hash, dtype: int64
Here seems to be a efficient workaround, converting the cut result in a list :
In [52]: df.groupby(pd.cut(df.atime,bins=arange(27,40,2)).tolist()
).hash.nunique()
Out[52]:
atime
(27, 29] 3
(37, 39] 2
Name: hash, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With