Why does dplyr
not sort the first column properly when it's composed of numeric and alphabetic characters?
> library(dplyr)
> y <- read.table("file.csv", sep = ",")
> arrange(y, V1)
V1 V2 V3 V4 V5 V6
1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
3 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
4 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
5 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
6 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
7 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
8 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
9 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
10 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
11 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
12 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
13 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
14 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
15 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
I'd like to see:
V1 V2 V3 V4 V5 V6
1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
2 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
3 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
4 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
5 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
6 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
7 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
8 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
9 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
10 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
11 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
12 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
13 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
14 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
15 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
Your disregard of alpha is a bit problematic, but how about:
library(dplyr)
arrange(y, as.numeric(V1))
# Warning in order(as.numeric(y$V1)) : NAs introduced by coercion
# V1 V2 V3 V4 V5 V6
# 1 1 0.97348999 0.11047091 0.95841014 0.61826620 0.43164420
# 8 2 0.22754619 0.16812359 0.39432991 0.68562992 0.43066861
# 9 3 0.33318220 0.21108688 0.60911213 0.64475379 0.98617404
# 10 4 0.57208511 0.58709229 0.29435093 0.78603855 0.81185551
# 11 5 0.35548490 0.15229426 0.42423263 0.72963238 0.04401239
# 12 6 0.08575802 0.33310521 0.09671737 0.90820671 0.33289880
# 13 7 0.05743798 0.20439928 0.56411860 0.54859270 0.81053637
# 14 8 0.99056584 0.29960046 0.20765701 0.45722997 0.51354034
# 15 9 0.35839568 0.11667019 0.56498996 0.43971051 0.23968955
# 2 10 0.82178167 0.21619067 0.11993356 0.06335101 0.28703842
# 3 11 0.35952632 0.27595845 0.24760335 0.63887200 0.47491472
# 4 12 0.43775624 0.08852486 0.06870304 0.63670202 0.55432641
# 5 13 0.83894086 0.40484966 0.96735507 0.86764578 0.02588688
# 6 14 0.95258399 0.65029909 0.97183605 0.87688243 0.97729517
# 7 15 0.62839615 0.52999000 0.05722874 0.40709867 0.56039580
# 16 A 0.25645249 0.07045102 0.17046681 0.75700118 0.50269449
# 17 B 0.57722865 0.31544398 0.33129932 0.44173772 0.11600295
# 18 C 0.94242373 0.55745376 0.01542128 0.01723924 0.11413310
This also works with base:
y[ order(as.numeric(y$V1)), ]
Edit: OP then asked (deSpite! having said "I don't really care" ;-) how to then sort the non-numeric fields.
The reason the first command works is that the non-numeric fields are all converted to NA
, which conveniently puts them after numbers in a sort. Well, both dplyr::arrange
and base::order
take arbitrary arguments, where ties in the first column are handled by the second argument, etc. So, in order to sort among the NA
s (non-numeric V1
elements), just add something that makes sense amongst them, such as ... "them":
arrange(y, as.numeric(V1), V1)
y[ order(as.numeric(y$V1), y$V1), ]
I recommend making V1 a factor and sorting the levels with a stringr package function before arranging:
> library(dplyr)
> library(stringr)
> y <- tibble(V1 = c("B", "A", 2, 1), V2 =c(1,2,3,4), V3=c(1,2,3,4))
> y %>%
dplyr::mutate(V1_fac = factor(V1, levels= str_sort(V1, numeric=TRUE))) %>%
dplyr::arrange(V1_fac)
The numeric=TRUE option allows to sort V1 digits numerically, instead of as strings. If some entries in V1 are not unique, you might want to use:
y %>%
dplyr::mutate(V1_fac = factor(V1, levels= str_sort(unique(V1), numeric=TRUE))) %>%
dplyr::arrange(V1_fac)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With