Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unique Challenge Replacing Soft-deprecated funs()

Tags:

r

The Problem:

I have a DataFrame consisting purely of variables that are of the numeric data type. I have a routine that's done a good job in the past of checking each variable in the DataFrame for statistical outliers and replacing any identified outliers with NA values. However, this routine makes use of the recently soft-deprecated funs().

Having researched this issue, I know that you're supposed to be able to basically replace funs() with list(~ example_func()) for example:

>funs(mean(., trim = .2), median(., na.rm = TRUE))
>
>Would become:
>
>list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))

Unfortunately, this remedy is not working in my use case.

The Functioning, But Now Soft-Deprecated Code:

The following code works, as seen below (for variables with outliers, the outliers ARE replaced with NA values); however, it triggers a warning with regard to the now soft-deprecated funs():

> # Which variables have missing values
> sapply(training_imptd, function(x) sum(is.na(x)))
           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
               0                0                0                0                0 
 TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
               0                0              102              131              772 
TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
            2085                0                0                0              102 
 TEAM_FIELDING_E TEAM_FIELDING_DP 
               0              286 
> 
> # Identify outliers and set them to NA (NAs to be fixed in next step by mice)
> training_imptd <- training_imptd %>%
+   mutate_all(
+     funs(ifelse(. %in% boxplot.stats(training_imptd$.)$out, NA, .))
+   )
>
> Warning: funs() is soft deprecated as of dplyr 0.8.0
> Please use a list of either functions or lambdas: 
> 
>   # Simple named list: 
>   list(mean = mean, median = median)
> 
>   # Auto named with `tibble::lst()`: 
>   tibble::lst(mean, median)
> 
>   # Using lambdas
>   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
> This warning is displayed once per session. 
>
> # Which variables have missing values (after imputing NA for outliers)
> sapply(training_imptd, function(x) sum(is.na(x)))
           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
               0               32               67               15               29 
 TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
               0              129              102              252              827 
TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
            2086              213                4               90              140 
 TEAM_FIELDING_E TEAM_FIELDING_DP 
             303              318 

The Remediated Code That Should Work, But Doesn't:

Based on what I've read about replacing funs() with list(~ example_func()), I would expect the following code to perform exactly as the code above that leverages funs(), but it doesn't (for variables with outliers, the outliers are NOT replaced with NA values):

> # Which variables have missing values
> sapply(training_imptd, function(x) sum(is.na(x)))
           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
               0                0                0                0                0 
 TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
               0                0              102              131              772 
TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
            2085                0                0                0              102 
 TEAM_FIELDING_E TEAM_FIELDING_DP 
               0              286 
> 
> # Identify outliers and set them to NA (NAs to be fixed in next step by mice)
> training_imptd <- training_imptd %>%
+   mutate_all(
+     list(~ ifelse(. %in% boxplot.stats(training_imptd$.)$out, NA, .))
+   )
> 
> # Which variables have missing values (after imputing NA for outliers)
> sapply(training_imptd, function(x) sum(is.na(x)))
           INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
               0                0                0                0                0 
 TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
               0                0              102              131              772 
TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
            2085                0                0                0              102 
 TEAM_FIELDING_E TEAM_FIELDING_DP 
               0              286 
like image 441
matteblack Avatar asked Nov 06 '22 13:11

matteblack


1 Answers

Remove the unnecessary training_imptd$ from the inside of your function. The pronoun . already refers to "the current column", so you can pass it to boxplot.stats() directly:

training_imptd %>%
  mutate_all(
    ~ifelse(. %in% boxplot.stats(.)$out, NA, .)
  )
like image 195
Artem Sokolov Avatar answered Nov 15 '22 06:11

Artem Sokolov