Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a data table with specific value order

Tags:

r

data.table

Here is a short data.table:

DT <- data.table(Tag1 = c(22,253,6219,6219,252862,252864,312786,312812),
                 Tag2 = c(22,255,6220,252857,252863,252865,251191,252863),
                 Date= as.Date(as.character(c("7/25/2008","6/15/2000","6/30/2000","9/6/2002","9/6/2002","9/6/2002","9/3/2003","9/5/2003")),format = "%m/%d/%Y"))
DT

     Tag1   Tag2       Date
1:     22     22 2008-07-25
2:    253    255 2000-06-15
3:   6219   6220 2000-06-30
4:   6219 252857 2002-09-06
5: 252862 252863 2002-09-06
6: 252864 252865 2002-09-06
7: 312786 251191 2003-09-03
8: 312812 252863 2003-09-05

I would like to sort the data.table in ascending order by 3 columns: Tag1, Tag2 and Date. I tested:

> test <- DT[order(Tag1, Tag2, Date)]
> test
     Tag1   Tag2       Date
1:     22     22 2008-07-25
2:    253    255 2000-06-15
3:   6219   6220 2000-06-30
4:   6219 252857 2002-09-06
5: 252862 252863 2002-09-06
6: 252864 252865 2002-09-06
7: 312786 251191 2003-09-03
8: 312812 252863 2003-09-05

However, I would like to sort the data.table as follows:

> test
         Tag1   Tag2       Date
    1:     22     22 2008-07-25
    2:    253    255 2000-06-15
    3:   6219   6220 2000-06-30
    4:   6219 252857 2002-09-06
    5: 252862 252863 2002-09-06
    6: 312812 252863 2003-09-05
    7: 252864 252865 2002-09-06
    8: 312786 251191 2003-09-03

In particular, the duplicated values for Tag1 or Tag1 should be disposed one below the other (for example: 6219 for Tag1 and 252863 for Tag2). How can I do this ?

EDIT:

The proposed solutions work for a short data.table (like the data.table above). Here is a longer version:

DT <- data.table(Tag1 = c(252860, 252862, 312812, 252864, 252866, 252868, 252870, 318880, 252872, 252874, 252876, 252878, 252880, 252880, 252881, 252883,
252885, 252887, 311264, 252889, 252889, 252892, 318879, 318880, 318881), Tag2 = c(252861, 252863, 252863, 252865, 252867, 252869, 252871, 252871, 252873,
252875, 252877, 252879, 414611, 905593, 252882, 252884, 252886, 252888, 252888, 252890, 318904, 252893, 318878, 414547, 318882), Date = c("9/6/2002",
"9/6/2002", "9/5/2003", "9/6/2002", "9/6/2002", "9/6/2002", "9/6/2002", "10/8/2003", "9/6/2002", "9/6/2002", "9/6/2002", "9/6/2002", "10/5/2004",
"9/6/2002", "9/6/2002", "9/6/2002", "9/10/2002", "9/10/2002", "7/15/2003", "9/10/2002", "10/15/2003", "9/10/2002", "10/8/2003", "9/29/2004","10/8/2003"))

Here is the expected result (i.e., data.table "After"). In particular, the data.table "After" should respect two conditions:

1) the rows are sorted by date in ascending order

2) the duplicated values for Tag1 or Tag1 are disposed one below the other (eventually not necessary in ascending order)

All duplicated values for Tag1 and Tag2 are in yellow.

enter image description here

like image 712
Nell Avatar asked Aug 10 '16 17:08

Nell


3 Answers

Old Order

df[order(Tag1, Tag2, Date)]
#      Tag1   Tag2       Date
# 1:     22     22 2008-07-25
# 2:    253    255 2000-06-15
# 3:   6219   6220 2000-06-30
# 4:   6219 252857 2002-09-06
# 5: 252862 252863 2002-09-06
# 6: 252864 252865 2002-09-06
# 7: 312786 251191 2003-09-03
# 8: 312812 252863 2003-09-05

New Order
Sort Date column in descending order, then sort Tag1 in ascending order grouped by Tag2.

setcolorder(dt1 <- df[order(-Date)][order(Tag1), .SD, by = Tag2], colnames(df))

dt1
#      Tag1   Tag2       Date
# 1:     22     22 2008-07-25
# 2:    253    255 2000-06-15
# 3:   6219 252857 2002-09-06
# 4:   6219   6220 2000-06-30
# 5: 252862 252863 2002-09-06
# 6: 312812 252863 2003-09-05
# 7: 252864 252865 2002-09-06
# 8: 312786 251191 2003-09-03

Solution of @akrun in the comments disturbs the structure of the data. Here is the comparison. Look at #4: 6219 should have 252857 instead of 251191

df[,lapply(df, sort)]
#      Tag1   Tag2       Date
# 1:     22     22 2000-06-15
# 2:    253    255 2000-06-30
# 3:   6219   6220 2002-09-06
# 4:   6219 251191 2002-09-06
# 5: 252862 252857 2002-09-06
# 6: 252864 252863 2003-09-03
# 7: 312786 252863 2003-09-05
# 8: 312812 252865 2008-07-25
like image 59
Sathish Avatar answered Nov 12 '22 10:11

Sathish


The data in the example are already sorted so here is a shortened version in unsorted order to illustrate changing the sort order.

> library(data.table)
> DT <- data.table(Tag1 = c(22,253,22,22),
                   Tag2 = c(1,255,2,2),
                   Date = as.Date(as.character(c(
                 "1/1/2010","4/4/2000","3/3/2003","2/2/2000")), format = "%m/%d/%Y"))
> DT                 

   Tag1 Tag2       Date
1:   22    1 2010-01-01
2:  253  255 2000-04-04
3:   22    2 2003-03-03
4:   22    2 2000-02-02

Use order to create a new data.table in a particular sorted order. Order function takes a single field or multiple fields to sort on in the specified order.

# sorts first by Tag1 then by Tag2 and finally by Date (in ascending order)
> DT2 <- DT[order(Tag1, Tag2, Date)]
> DT2

   Tag1 Tag2       Date
1:   22    1 2010-01-01
2:   22    2 2000-02-02
3:   22    2 2003-03-03
4:  253  255 2000-04-04

To re-order the data.table in place use setorder function.

setorder(DT, Tag1, Tag2, Date)

If want to sort in descending order then use - prefix on the field names.

 > DT[order(Tag1, Tag2, -Date)]
 > setorder(DT, Tag1, Tag2, -Date)
like image 2
CodeMonkey Avatar answered Nov 12 '22 10:11

CodeMonkey


In an update to the question, OP indicated that the previous answers (including the accepted answer) do not work for a longer data set. Using the new data after the question edit, we can sort like this:

The trick is to start with the table ordered by date then tags, in ascending order (the initial data in OP are already in this state, but for generality I start my ensuring the data are ordered by setkey(DT, asDate, Tag1, Tag2)). Then number the distinct values of tag2, in the order they appear and sort by these group numbers (the next two lines). This will make sure that identical values of tag2 follow each other, without otherwise disturbing the order of the table. Next, do the same thing for tag1.

DT[, asDate := as.Date(Date, format = "%m/%d/%Y")]
setkey(DT, asDate, Tag1, Tag2)

DT[, g2 := .GRP, Tag2]
setkey(DT, g2)
DT[, g1 := .GRP, Tag1]
setkey(DT, g1)

DT[, c("g1", "g2", "asDate") := NULL][]

      Tag1   Tag2       Date
 1: 252860 252861   9/6/2002
 2: 252862 252863   9/6/2002
 3: 312812 252863   9/5/2003
 4: 252864 252865   9/6/2002
 5: 252866 252867   9/6/2002
 6: 252868 252869   9/6/2002
 7: 252870 252871   9/6/2002
 8: 318880 252871  10/8/2003
 9: 318880 414547  9/29/2004
10: 252872 252873   9/6/2002
11: 252874 252875   9/6/2002
12: 252876 252877   9/6/2002
13: 252878 252879   9/6/2002
14: 252880 905593   9/6/2002
15: 252880 414611  10/5/2004
16: 252881 252882   9/6/2002
17: 252883 252884   9/6/2002
18: 252885 252886  9/10/2002
19: 252887 252888  9/10/2002
20: 311264 252888  7/15/2003
21: 252889 252890  9/10/2002
22: 252889 318904 10/15/2003
23: 252892 252893  9/10/2002
24: 318879 318878  10/8/2003
25: 318881 318882  10/8/2003
      Tag1   Tag2       Date

Notes

This works perfectly on the sample data. However, a word of caution. There is no guarantee that a solution even exists for all possible data (either by this method or by any method), particularly if repeated tags are a large fraction of all the tags. Consider, for example, the following columns in which it is not possible to order the rows such that the same letters always appear consecutively in both columns simultaneously:

 a b
 a c
 b a
 b c
 c a
 c b
like image 2
dww Avatar answered Nov 12 '22 09:11

dww