Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr arrange() function sort by missing values

Tags:

sorting

r

na

dplyr

I am attempting to work through Hadley Wickham's R for Data Science and have gotten tripped up on the following question: "How could you use arrange() to sort all missing values to the start? (Hint: use is.na())" I am using the flights dataset included in the nycflights13 package. Given that arrange() sorts all unknown values to the bottom of the dataframe, I am not sure how one would do the opposite across the missing values of all variables. I realize that this question can be answered with base R code, but I am specifically interested in how this would be done using dplyr and a call to the arrange() and is.na() functions. Thanks.

like image 520
T. Gross Avatar asked Jun 11 '16 06:06

T. Gross


People also ask

How would you use Arrange () to sort all missing values to the start?

How could you use arrange() to sort all missing values to the start? (Hint: use is.na() ). The arrange() function puts NA values last. Using desc() does not change that. To put NA values first, we can add an indicator of whether the column has a missing value.

What does arrange () do in R?

The arrange() function lets you reorder the rows of a tibble. It takes a tibble, followed by the unquoted names of columns. For example, to sort in ascending order of the values of column x , then (where there is a tie in x ) by descending order of values of y , you would write the following.

How do you arrange in ascending order in dplyr?

Sort a data frame rows in ascending order (from low to high) using the R function arrange() [dplyr package] Sort rows in descending order (from high to low) using arrange() in combination with the function desc() [dplyr package]

How do you rearrange the order of a column in a data set using dplyr functions?

order() is used to rearrange the dataframe columns in alphabetical order. colnames() is the function to get the columns in the dataframe. decreasing=TRUE parameter specifies to sort the dataframe in descending order.


2 Answers

Try the easiest way, what he just showed you:

arrange(flights, desc(is.na(dep_time)))

The other nice shortcuts:

arrange(flights, !is.na(dep_time))

or

arrange(flights, -is.na(dep_time))
like image 141
Arkadiusz Choczaj Avatar answered Sep 16 '22 18:09

Arkadiusz Choczaj


We can wrap it with desc to get the missing values at the start

flights %>% 
    arrange(desc(is.na(dep_time)),
           desc(is.na(dep_delay)),
           desc(is.na(arr_time)), 
           desc(is.na(arr_delay)),
           desc(is.na(tailnum)),
           desc(is.na(air_time)))

The NA values were only found in those variables based on

names(flights)[colSums(is.na(flights)) >0]
#[1] "dep_time"  "dep_delay" "arr_time"  "arr_delay" "tailnum"   "air_time" 

Instead of passing each variable name at a time, we can also use NSE arrange_

nm1 <- paste0("desc(is.na(", names(flights)[colSums(is.na(flights)) >0], "))")

r1 <- flights %>%
        arrange_(.dots = nm1) 

r1 %>%
   head()
#year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
#  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>
#1  2013     1     2       NA           1545        NA       NA           1910        NA      AA    133    <NA>
#2  2013     1     2       NA           1601        NA       NA           1735        NA      UA    623    <NA>
#3  2013     1     3       NA            857        NA       NA           1209        NA      UA    714    <NA>
#4  2013     1     3       NA            645        NA       NA            952        NA      UA    719    <NA>
#5  2013     1     4       NA            845        NA       NA           1015        NA      9E   3405    <NA>
#6  2013     1     4       NA           1830        NA       NA           2044        NA      9E   3716    <NA>
#Variables not shown: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#  time_hour <time>.

Update

With the newer versions of tidyverse (dplyr_0.7.3, rlang_0.1.2) , we can also make use of arrange_at, arrange_all, arrange_if

nm1 <- names(flights)[colSums(is.na(flights)) >0]
r2 <- flights %>% 
          arrange_at(vars(nm1), funs(desc(is.na(.))))

Or use arrange_if

f <- rlang::as_function(~ any(is.na(.)))
r3 <- flights %>% 
          arrange_if(f, funs(desc(is.na(.))))


identical(r1, r2)
#[1] TRUE

identical(r1, r3)
#[1] TRUE
like image 28
akrun Avatar answered Sep 18 '22 18:09

akrun