I'm trying to find any variables in my data that have zero variance (i.e. constant continuous variables). I figured out how to do it with lapply but I would like to use dplyr as I'm trying to follow tidy data principles. I can create a vector of just the variances using dplyr but its the last step where I find the values not equal to zero and return the variable names that confusing me.
Here's the code
library(PReMiuM)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.7.2 ✔ stringr 1.2.0
#> ✔ readr 1.2.0 ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
setwd("~/Stapleton_Lab/Projects/Premium/hybridAnalysis/")
# read in data from analysis script
df <- read_csv("./hybrid.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> Exp = col_character(),
#> Pedi = col_character(),
#> Harvest = col_character()
#> )
#> See spec(...) for full column specifications.
# checking for missing variable
# df %>%
# select_if(function(x) any(is.na(x))) %>%
# summarise_all(funs(sum(is.na(.))))
# grab month for analysis
may <- df %>%
filter(Month==5)
june <- df %>%
filter(Month==6)
july <- df %>%
filter(Month==7)
aug <- df %>%
filter(Month==8)
sept <- df %>%
filter(Month==9)
oct <- df %>%
filter(Month==10)
# check for zero variance in continuous covariates
numericVars <- grep("Min|Max",names(june))
zero <- which(lapply(june[numericVars],var)==0,useNames = TRUE)
noVar <- june %>%
select(numericVars) %>%
summarise_all(var) %>%
filter_if(all, all_vars(. != 0))
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
With a reproducible example, I think what you are aiming for is below. Please note that as pointed out by Colin, I have not dealt with the issue of you selecting variables with a character variable. See his answer for details on that.
# reproducible data
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7
library(dplyr)
mtcars2 %>%
summarise_all(var) %>%
select_if(function(.) . == 0) %>%
names()
# [1] "mpg" "qsec"
Personally, I think that obfuscates what you are doing. One of the following using the purrr
package (if you wish to remain in the tidyverse) would be my preference, with a well written comment.
library(purrr)
# Return a character vector of variable names which have 0 variance
names(mtcars2)[which(map_dbl(mtcars2, var) == 0)]
names(mtcars2)[map_lgl(mtcars2, function(x) var(x) == 0)]
If you'd like to optimize it for speed, stick with base R
# Return a character vector of variable names which have 0 variance
names(mtcars2)[vapply(mtcars2, function(x) var(x) == 0, logical(1))]
You have two problems.
select()
The vignette about that is here. programming with dplyr. The solution here is to use the select_at()
scoped variant of the select function.
noVar <- june %>%
select_at(.vars=numericVars) %>%
summarise_all(.funs=var) %>%
filter_all(any_vars(. == 0))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With