Is there a more efficient query than the following
DT[, list(length(unique(OrderNo)) ),customerID]
to refine a LONG format table with customer id's, order number and product line items, meaning that there will be duplicate rows with the same order id if a customer has purchased more than 1 item in that transaction.
Trying to work out unique purchases. length()
gives a count of all order id's by customer ID including duplicates, looking for just the unique number.
Here is some dummy code. Ideally what i am looking for is the output from the first query using the unique()
.
df <- data.frame(
customerID=as.factor(c(rep("A",3),rep("B",4))),
product=as.factor(c(rep("widget",2),rep("otherstuff",5))),
orderID=as.factor(c("xyz","xyz","abd","qwe","rty","yui","poi")),
OrderDate=as.Date(c("2013-07-01","2013-07-01","2013-07-03","2013-06-01","2013-06-02","2013-06-03","2013-07-01"))
)
DT.eg <- as.data.table(df)
#Gives unique order counts
DT.eg[, list(orderlength = length(unique(orderID)) ),customerID]
#Gives counts of all orders by customer
DT.eg[,.SD, keyby=list(orderID, customerID)][, .N, by=customerID]
^
|
This should be .N, not .SD ~ R.S.
data. table(DT) is TRUE. To better description, I put parts of my original code here. So you may understand where goes wrong.
The unique() function in R is used to eliminate or delete the duplicate values or the rows present in the vector, data frame, or matrix as well. The unique() function found its importance in the EDA (Exploratory Data Analysis) as it directly identifies and eliminates the duplicate values in the data.
data.table is an R package that provides an enhanced version of data.frame s, which are the standard data structure for storing data in base R. In the Data section above, we already created a data.table using fread() . We can also create one using the data.table() function.
if you are trying to count the number of unique purchases per customer, use
DT[, .N, keyby=list(customerId, OrderNo)][, .N, by=customerId]
As of version 1.9.6 (on CRAN 19 Sep 2015), data.table
has gained the helper function uniqueN()
which is equivalent to length(unique(x))
but much faster (according to data.table
NEWS).
With this,
DT.eg[, list(orderlength = length(unique(orderID)) ),customerID]
and
DT.eg[,.N, keyby=list(orderID, customerID)][, .N, by=customerID]
can be rewritten as
DT.eg[, .(orderlength = uniqueN(orderID)), customerID]
customerID orderlength 1: A 2 2: B 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With