Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient use of R data.table and unique()

Tags:

r

data.table

Is there a more efficient query than the following

DT[, list(length(unique(OrderNo)) ),customerID]

to refine a LONG format table with customer id's, order number and product line items, meaning that there will be duplicate rows with the same order id if a customer has purchased more than 1 item in that transaction.

Trying to work out unique purchases. length() gives a count of all order id's by customer ID including duplicates, looking for just the unique number.

Edit from here:

Here is some dummy code. Ideally what i am looking for is the output from the first query using the unique().

df <- data.frame(
             customerID=as.factor(c(rep("A",3),rep("B",4))),
             product=as.factor(c(rep("widget",2),rep("otherstuff",5))),
             orderID=as.factor(c("xyz","xyz","abd","qwe","rty","yui","poi")),
             OrderDate=as.Date(c("2013-07-01","2013-07-01","2013-07-03","2013-06-01","2013-06-02","2013-06-03","2013-07-01"))
             )

DT.eg <- as.data.table(df)
#Gives unique order counts
DT.eg[, list(orderlength = length(unique(orderID)) ),customerID]
#Gives counts of all orders by customer
DT.eg[,.SD, keyby=list(orderID, customerID)][, .N, by=customerID]

         ^
         |
  This should be .N, not .SD  ~ R.S.
like image 785
digdeep Avatar asked Oct 24 '13 01:10

digdeep


People also ask

Is data table DT == true?

data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.

What are unique values in R?

The unique() function in R is used to eliminate or delete the duplicate values or the rows present in the vector, data frame, or matrix as well. The unique() function found its importance in the EDA (Exploratory Data Analysis) as it directly identifies and eliminates the duplicate values in the data.

What is data table in R?

data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.


2 Answers

if you are trying to count the number of unique purchases per customer, use

 DT[, .N, keyby=list(customerId, OrderNo)][, .N, by=customerId]
like image 50
Ricardo Saporta Avatar answered Oct 05 '22 12:10

Ricardo Saporta


As of version 1.9.6 (on CRAN 19 Sep 2015), data.table has gained the helper function uniqueN() which is equivalent to length(unique(x)) but much faster (according to data.table NEWS).

With this,

DT.eg[, list(orderlength = length(unique(orderID)) ),customerID]

and

DT.eg[,.N, keyby=list(orderID, customerID)][, .N, by=customerID]

can be rewritten as

DT.eg[, .(orderlength = uniqueN(orderID)), customerID]
   customerID orderlength
1:          A           2
2:          B           4
like image 31
Uwe Avatar answered Oct 05 '22 14:10

Uwe