Sort not sorting numbers correctly in R

Question

I have data that looks like this:

    score        temp
1 a.score  0.05502011
2 b.score  0.02484594
3 c.score -0.07183767
4 d.score -0.06932274
5 e.score -0.15512460

I want to sort the sames based on the values from most negative to most positive, taking the top 4. I try:

> topfour.values <- apply(temp.df, 2, function(xx)head(sort(xx), 4, na.rm = TRUE, decreasing = FALSE))
> topfour.names  <- apply(temp.df, 2, function(xx)head(names(sort(xx)), 4, na.rm = TRUE))
> topfour        <- rbind(topfour.names, topfour.values)

and I get

> topfour.values
                        temp[, 1]                           
    d.score              "-0.06932274"            
    c.score              "-0.0718376680"          
    e.score              "-0.1551246"             
    b.score              " 0.02484594"

What order is this? What did I do wrong and how do I get it sorted properly?

I've tried method == "Quick" and method == "Shell" as options, but the order still doesn't make sense.

Paulo E. Cardoso · Accepted Answer

It is my belief that you are getting your data in the wrong type. It would be useful to know how you are getting your data into R. In the example above you are handling a character vector not a numeric one.

head(with(df, df[order(temp), ]), 4)
    score        temp
5 e.score -0.15512460
3 c.score -0.07183767
4 d.score -0.06932274
2 b.score  0.02484594

Taking the proposed approach from Greg Snow, and considering that you are only interested in the vector of top values, and it is impossible to use the partial argument in this case, a simple speed test on comparing order and sorl.list shows that the differences may be irrelevant, even for a 1e7 size vector.

df1 <- data.frame(temp = rnorm(1e+7),
                  score = sample(letters, 1e+7, rep = T))

library(microbenchmark)
microbenchmark(
  head(with(df1, df1[order(temp), 1]), 4),
  head(with(df1, df1[sort.list(temp), 1]), 4),
  head(df1[order(df1$temp), 1], 4),
  head(df1[sort.list(df1$temp), 1], 4),
  times = 1L
  )

Unit: seconds
                                        expr      min       lq   median       uq      max neval
     head(with(df1, df1[order(temp), 1]), 4) 13.42581 13.42581 13.42581 13.42581 13.42581     1
 head(with(df1, df1[sort.list(temp), 1]), 4) 13.80256 13.80256 13.80256 13.80256 13.80256     1
            head(df1[order(df1$temp), 1], 4) 13.88580 13.88580 13.88580 13.88580 13.88580     1
        head(df1[sort.list(df1$temp), 1], 4) 13.13579 13.13579 13.13579 13.13579 13.13579     1

Greg Snow · Answer

There are several problems, some of which have been discussed in the comments, but one big one that I have not seen mentioned yet is that the apply function works on matrices and therefore converts your data frame to a matrix before doing anything else. Since your data has both a factor and a numeric variable the numbers are converted to character strings and the sorting is done on the character string representation, not the numerical value. Using the tools that work directly with data frames (and lists) will prevent this as well as using order and avoiding apply altogether.

Also, if you only want the $n$ largest or smallest values then you may be able to speed things up a little by using sort.list instead of order and specifying the partial argument.

Sort not sorting numbers correctly in R

Tags:

sorting

r

Hack-R

2 Answers

Paulo E. Cardoso

Greg Snow

Recent Activity

Donate For Us

Sort not sorting numbers correctly in R

Tags:

sorting

r

Hack-R

2 Answers

Paulo E. Cardoso

Greg Snow

Related questions

Recent Activity

Donate For Us