Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R dplyr column sort with alphanumeric characters

Tags:

r

dplyr

Why does dplyr not sort the first column properly when it's composed of numeric and alphabetic characters?

> library(dplyr)
> y <- read.table("file.csv", sep = ",")
> arrange(y, V1)
   V1         V2         V3         V4         V5         V6
1   1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2  10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
3  11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
4  12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
5  13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
6  14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
7  15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
8   2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
9   3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
10  4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
11  5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
12  6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
13  7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
14  8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
15  9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
16  A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17  B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18  C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310

I'd like to see:

   V1         V2         V3         V4         V5         V6
1   1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2   2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
3   3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
4   4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
5   5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
6   6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
7   7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
8   8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
9   9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
10 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
11 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
12 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
13 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
14 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
15 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
16  A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17  B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18  C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
like image 598
warship Avatar asked Mar 12 '23 10:03

warship


2 Answers

Your disregard of alpha is a bit problematic, but how about:

library(dplyr)
arrange(y, as.numeric(V1))
# Warning in order(as.numeric(y$V1)) : NAs introduced by coercion
#    V1         V2         V3         V4         V5         V6
# 1   1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
# 8   2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
# 9   3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
# 10  4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
# 11  5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
# 12  6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
# 13  7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
# 14  8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
# 15  9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
# 2  10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
# 3  11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
# 4  12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
# 5  13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
# 6  14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
# 7  15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
# 16  A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
# 17  B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
# 18  C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310

This also works with base:

y[ order(as.numeric(y$V1)), ]

Edit: OP then asked (deSpite! having said "I don't really care" ;-) how to then sort the non-numeric fields.

The reason the first command works is that the non-numeric fields are all converted to NA, which conveniently puts them after numbers in a sort. Well, both dplyr::arrange and base::order take arbitrary arguments, where ties in the first column are handled by the second argument, etc. So, in order to sort among the NAs (non-numeric V1 elements), just add something that makes sense amongst them, such as ... "them":

arrange(y, as.numeric(V1), V1)
y[ order(as.numeric(y$V1), y$V1), ]
like image 146
r2evans Avatar answered Mar 24 '23 21:03

r2evans


I recommend making V1 a factor and sorting the levels with a stringr package function before arranging:

> library(dplyr)
> library(stringr)

> y <- tibble(V1 = c("B", "A", 2, 1), V2 =c(1,2,3,4), V3=c(1,2,3,4))
> y %>%  
     dplyr::mutate(V1_fac = factor(V1, levels= str_sort(V1, numeric=TRUE))) %>%
     dplyr::arrange(V1_fac)

The numeric=TRUE option allows to sort V1 digits numerically, instead of as strings. If some entries in V1 are not unique, you might want to use:

y %>%  
     dplyr::mutate(V1_fac = factor(V1, levels= str_sort(unique(V1), numeric=TRUE))) %>%
     dplyr::arrange(V1_fac)
like image 42
Dominik Szabo Avatar answered Mar 24 '23 21:03

Dominik Szabo