Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort not sorting numbers correctly in R

Tags:

sorting

r

I have data that looks like this:

    score        temp
1 a.score  0.05502011
2 b.score  0.02484594
3 c.score -0.07183767
4 d.score -0.06932274
5 e.score -0.15512460

I want to sort the sames based on the values from most negative to most positive, taking the top 4. I try:

> topfour.values <- apply(temp.df, 2, function(xx)head(sort(xx), 4, na.rm = TRUE, decreasing = FALSE))
> topfour.names  <- apply(temp.df, 2, function(xx)head(names(sort(xx)), 4, na.rm = TRUE))
> topfour        <- rbind(topfour.names, topfour.values)

and I get

> topfour.values
                        temp[, 1]                           
    d.score              "-0.06932274"            
    c.score              "-0.0718376680"          
    e.score              "-0.1551246"             
    b.score              " 0.02484594"   

What order is this? What did I do wrong and how do I get it sorted properly?

I've tried method == "Quick" and method == "Shell" as options, but the order still doesn't make sense.

like image 438
Hack-R Avatar asked Jan 27 '26 07:01

Hack-R


2 Answers

It is my belief that you are getting your data in the wrong type. It would be useful to know how you are getting your data into R. In the example above you are handling a character vector not a numeric one.

head(with(df, df[order(temp), ]), 4)
    score        temp
5 e.score -0.15512460
3 c.score -0.07183767
4 d.score -0.06932274
2 b.score  0.02484594

Taking the proposed approach from Greg Snow, and considering that you are only interested in the vector of top values, and it is impossible to use the partial argument in this case, a simple speed test on comparing order and sorl.list shows that the differences may be irrelevant, even for a 1e7 size vector.

df1 <- data.frame(temp = rnorm(1e+7),
                  score = sample(letters, 1e+7, rep = T))

library(microbenchmark)
microbenchmark(
  head(with(df1, df1[order(temp), 1]), 4),
  head(with(df1, df1[sort.list(temp), 1]), 4),
  head(df1[order(df1$temp), 1], 4),
  head(df1[sort.list(df1$temp), 1], 4),
  times = 1L
  )

Unit: seconds
                                        expr      min       lq   median       uq      max neval
     head(with(df1, df1[order(temp), 1]), 4) 13.42581 13.42581 13.42581 13.42581 13.42581     1
 head(with(df1, df1[sort.list(temp), 1]), 4) 13.80256 13.80256 13.80256 13.80256 13.80256     1
            head(df1[order(df1$temp), 1], 4) 13.88580 13.88580 13.88580 13.88580 13.88580     1
        head(df1[sort.list(df1$temp), 1], 4) 13.13579 13.13579 13.13579 13.13579 13.13579     1
like image 148
Paulo E. Cardoso Avatar answered Jan 29 '26 23:01

Paulo E. Cardoso


There are several problems, some of which have been discussed in the comments, but one big one that I have not seen mentioned yet is that the apply function works on matrices and therefore converts your data frame to a matrix before doing anything else. Since your data has both a factor and a numeric variable the numbers are converted to character strings and the sorting is done on the character string representation, not the numerical value. Using the tools that work directly with data frames (and lists) will prevent this as well as using order and avoiding apply altogether.

Also, if you only want the $n$ largest or smallest values then you may be able to speed things up a little by using sort.list instead of order and specifying the partial argument.

like image 44
Greg Snow Avatar answered Jan 29 '26 23:01

Greg Snow



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!