I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code: <pre class="prettyprint"><code>sort by ascending Year, ascending MemberID, descending Month </code></pre> Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps: <pre class="prettyprint"><code>setkey(X, Year, MemberID) </code></pre> This is very fast and takes only a few second. <pre class="prettyprint"><code>X <- X[,.SD[order(-Month)],by=list(Year, MemberID)] </code></pre> This step takes so much longer (5 minutes). Update: Someone made a comment to do <code>X <- X[sort(Year, MemberID, -Month)]</code> and later deleted. This approach seems to be much faster: <pre class="prettyprint"><code>user system elapsed 5.560 11.242 66.236 </code></pre> My approach: setkey() then order(-Month) <pre class="prettyprint"><code> user system elapsed 816.144 9.648 848.798 </code></pre> My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order? Update 2: to response to Matthew Dowle: After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table: <pre class="prettyprint"><code>monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)] </code></pre> Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach: <pre class="prettyprint"><code> user system elapsed 7.910 7.750 53.916 </code></pre> Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.

The comment was mine, so I'll post the answer. I removed it because I couldn't test whether it was equivalent to what you already had. Glad to hear it's faster. <pre class="prettyprint"><code>X <- X[order(Year, MemberID, -Month)] </code></pre> Summarizing shouldn't depend on the order of your rows.

Sort a data.table fast by Ascending/Descending order

Q: How do you sort a Datatable in descending order?

Using the order initialisation parameter, you can set the table to display the data in exactly the order that you want. The order parameter is an array of arrays where the first value of the inner array is the column to order on, and the second is 'asc' (ascending ordering) or 'desc' (descending ordering) as required.

Q: How do you sort data table in R?

To sort a data frame in R, use the order( ) function. By default, sorting is ASCENDING. Prepend the sorting variable by a minus sign to indicate DESCENDING order.

Tags:

performance

r

data.table

I have a data.table with about 3 million rows and 40 columns. I would like to sort this table by descending order within groups like the following sql mock code:

sort by ascending Year, ascending MemberID, descending Month

Is there an equivalent way in data.table to do this? So far I have to break it down into 2 steps:

setkey(X, Year, MemberID)

This is very fast and takes only a few second.

X <- X[,.SD[order(-Month)],by=list(Year, MemberID)]

This step takes so much longer (5 minutes).

Update: Someone made a comment to do X <- X[sort(Year, MemberID, -Month)] and later deleted. This approach seems to be much faster:

user  system elapsed  5.560  11.242  66.236

My approach: setkey() then order(-Month)

   user  system elapsed  816.144   9.648 848.798

My question is now: if I want to summarize by Year, MemberId and Month after sort(Year, MemberID, Month), does data.table recognize the sort order?

Update 2: to response to Matthew Dowle:

After setkey with Year, MemberID and Month, I still have multiple records per group. What I would like is to summarize for each of the groups. What I meant was: if I use X[order(Year, MemberID, Month)], does the summation utilizes binary search functionality of data.table:

monthly.X <- X[, lapply(.SD[], sum), by = list(Year, MemberID, Month)]

Update 3: Matthew D proposed several approaches. Run time for the first approach is faster than order() approach:

   user  system elapsed    7.910   7.750  53.916

Matthew: what surprised me was converting the sign of Month takes most of the time. Without it, setkey is blazing fast.

397

asked Dec 03 '12 14:12

AdamNYC

2 Answers

Update June 5 2014:

The current development version of data.table v1.9.3 has two new functions implemented, namely: setorder and setorderv, which does exactly what you require. These functions reorder the data.table by reference with the option to choose either ascending or descending order on each column to order by. Check out ?setorder for more info.

In addition, DT[order(.)] is also by default optimised to use data.table's internal fast order instead of base:::order. This, unlike setorder, will make an entire copy of the data, and is therefore less memory efficient, but will still be orders of magnitude faster than operating using base's order.

Benchmarks:

Here's an illustration on the speed differences using setorder, data.table's internal fast order and with base:::order:

require(data.table) ## 1.9.3 set.seed(1L) DT <- data.table(Year     = sample(1950:2000, 3e6, TRUE),                   memberID = sample(paste0("V", 1:1e4), 3e6, TRUE),                   month    = sample(12, 3e6, TRUE))  ## using base:::order system.time(ans1 <- DT[base:::order(Year, memberID, -month)]) #   user  system elapsed  # 76.909   0.262  81.266   ## optimised to use data.table's fast order system.time(ans2 <- DT[order(Year, memberID, -month)]) #   user  system elapsed  #  0.985   0.030   1.027  ## reorders by reference system.time(setorder(DT, Year, memberID, -month)) #   user  system elapsed  #  0.585   0.013   0.600   ## or alternatively ## setorderv(DT, c("Year", "memberID", "month"), c(1,1,-1))  ## are they equal? identical(ans2, DT)    # [1] TRUE identical(ans1, ans2)  # [1] TRUE

On this data, benchmarks indicate that data.table's order is about ~79x faster than base:::order and setorder is ~135x faster than base:::order here.

data.table always sorts/orders in C-locale. If you should require to order in another locale, only then do you need to resort to using DT[base:::order(.)].

All these new optimisations and functions together constitute FR #2405. bit64::integer64 support also has been added.

NOTE: Please refer to the history/revisions for earlier answer and updates.

176

answered Sep 21 '22 09:09

Matt Dowle

The comment was mine, so I'll post the answer. I removed it because I couldn't test whether it was equivalent to what you already had. Glad to hear it's faster.

X <- X[order(Year, MemberID, -Month)]

Summarizing shouldn't depend on the order of your rows.

answered Sep 18 '22 09:09

Matthew Plourde

Related questions
                            
                                How do I change the default library path for R packages
                            
                                Using gsub to extract character string before white space in R
                            
                                Adjust plot title (main) position
                            
                                Scatter plot with error bars
                            
                                detecting operating system in R (e.g. for adaptive .Rprofile files)
                            
                                dev.hold, dev.flush and resizing windows
                            
                                Cannot log-in to rstudio-server
                            
                                How to properly document S4 methods using roxygen2
                            
                                Difference between read.csv() and read.csv2() in R
                            
                                What does the function invisible() do?
                            
                                import dat file into R
                            
                                Why TRUE == "TRUE" is TRUE in R?
                            
                                ggplot legends - change labels, order and title
                            
                                Test for numeric elements in a character string
                            
                                R shiny: display "loading..." message while function is running
                            
                                R strsplit with multiple unordered split arguments?
                            
                                Compare if two dataframe objects in R are equal?
                            
                                Essential skills of a Data Scientist [closed]
                            
                                Rescaling the y axis in bar plot causes bars to disappear : R ggplot2
                            
                                Forcing R output to be scientific notation with at most two decimals

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With