I am attempting to work through Hadley Wickham's R for Data Science and have gotten tripped up on the following question: "How could you use arrange() to sort all missing values to the start? (Hint: use is.na())" I am using the flights dataset included in the nycflights13 package. Given that arrange() sorts all unknown values to the bottom of the dataframe, I am not sure how one would do the opposite across the missing values of all variables. I realize that this question can be answered with base R code, but I am specifically interested in how this would be done using dplyr and a call to the arrange() and is.na() functions. Thanks.
How could you use arrange() to sort all missing values to the start? (Hint: use is.na() ). The arrange() function puts NA values last. Using desc() does not change that. To put NA values first, we can add an indicator of whether the column has a missing value.
The arrange() function lets you reorder the rows of a tibble. It takes a tibble, followed by the unquoted names of columns. For example, to sort in ascending order of the values of column x , then (where there is a tie in x ) by descending order of values of y , you would write the following.
Sort a data frame rows in ascending order (from low to high) using the R function arrange() [dplyr package] Sort rows in descending order (from high to low) using arrange() in combination with the function desc() [dplyr package]
order() is used to rearrange the dataframe columns in alphabetical order. colnames() is the function to get the columns in the dataframe. decreasing=TRUE parameter specifies to sort the dataframe in descending order.
Try the easiest way, what he just showed you:
arrange(flights, desc(is.na(dep_time)))
The other nice shortcuts:
arrange(flights, !is.na(dep_time))
or
arrange(flights, -is.na(dep_time))
We can wrap it with desc
to get the missing values at the start
flights %>%
arrange(desc(is.na(dep_time)),
desc(is.na(dep_delay)),
desc(is.na(arr_time)),
desc(is.na(arr_delay)),
desc(is.na(tailnum)),
desc(is.na(air_time)))
The NA values were only found in those variables based on
names(flights)[colSums(is.na(flights)) >0]
#[1] "dep_time" "dep_delay" "arr_time" "arr_delay" "tailnum" "air_time"
Instead of passing each variable name at a time, we can also use NSE arrange_
nm1 <- paste0("desc(is.na(", names(flights)[colSums(is.na(flights)) >0], "))")
r1 <- flights %>%
arrange_(.dots = nm1)
r1 %>%
head()
#year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
#1 2013 1 2 NA 1545 NA NA 1910 NA AA 133 <NA>
#2 2013 1 2 NA 1601 NA NA 1735 NA UA 623 <NA>
#3 2013 1 3 NA 857 NA NA 1209 NA UA 714 <NA>
#4 2013 1 3 NA 645 NA NA 952 NA UA 719 <NA>
#5 2013 1 4 NA 845 NA NA 1015 NA 9E 3405 <NA>
#6 2013 1 4 NA 1830 NA NA 2044 NA 9E 3716 <NA>
#Variables not shown: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <time>.
With the newer versions of tidyverse (dplyr_0.7.3
, rlang_0.1.2
) , we can also make use of arrange_at
, arrange_all
, arrange_if
nm1 <- names(flights)[colSums(is.na(flights)) >0]
r2 <- flights %>%
arrange_at(vars(nm1), funs(desc(is.na(.))))
Or use arrange_if
f <- rlang::as_function(~ any(is.na(.)))
r3 <- flights %>%
arrange_if(f, funs(desc(is.na(.))))
identical(r1, r2)
#[1] TRUE
identical(r1, r3)
#[1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With