I have a DataFrame consisting purely of variables that are of the numeric data type. I have a routine that's done a good job in the past of checking each variable in the DataFrame for statistical outliers and replacing any identified outliers with NA values. However, this routine makes use of the recently soft-deprecated funs().
Having researched this issue, I know that you're supposed to be able to basically replace funs() with list(~ example_func()) for example:
>funs(mean(., trim = .2), median(., na.rm = TRUE))
>
>Would become:
>
>list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
Unfortunately, this remedy is not working in my use case.
The following code works, as seen below (for variables with outliers, the outliers ARE replaced with NA values); however, it triggers a warning with regard to the now soft-deprecated funs():
> # Which variables have missing values
> sapply(training_imptd, function(x) sum(is.na(x)))
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
0 0 102 131 772
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
2085 0 0 0 102
TEAM_FIELDING_E TEAM_FIELDING_DP
0 286
>
> # Identify outliers and set them to NA (NAs to be fixed in next step by mice)
> training_imptd <- training_imptd %>%
+ mutate_all(
+ funs(ifelse(. %in% boxplot.stats(training_imptd$.)$out, NA, .))
+ )
>
> Warning: funs() is soft deprecated as of dplyr 0.8.0
> Please use a list of either functions or lambdas:
>
> # Simple named list:
> list(mean = mean, median = median)
>
> # Auto named with `tibble::lst()`:
> tibble::lst(mean, median)
>
> # Using lambdas
> list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
> This warning is displayed once per session.
>
> # Which variables have missing values (after imputing NA for outliers)
> sapply(training_imptd, function(x) sum(is.na(x)))
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 32 67 15 29
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
0 129 102 252 827
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
2086 213 4 90 140
TEAM_FIELDING_E TEAM_FIELDING_DP
303 318
Based on what I've read about replacing funs() with list(~ example_func()), I would expect the following code to perform exactly as the code above that leverages funs(), but it doesn't (for variables with outliers, the outliers are NOT replaced with NA values):
> # Which variables have missing values
> sapply(training_imptd, function(x) sum(is.na(x)))
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
0 0 102 131 772
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
2085 0 0 0 102
TEAM_FIELDING_E TEAM_FIELDING_DP
0 286
>
> # Identify outliers and set them to NA (NAs to be fixed in next step by mice)
> training_imptd <- training_imptd %>%
+ mutate_all(
+ list(~ ifelse(. %in% boxplot.stats(training_imptd$.)$out, NA, .))
+ )
>
> # Which variables have missing values (after imputing NA for outliers)
> sapply(training_imptd, function(x) sum(is.na(x)))
INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
0 0 0 0 0
TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
0 0 102 131 772
TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
2085 0 0 0 102
TEAM_FIELDING_E TEAM_FIELDING_DP
0 286
Remove the unnecessary training_imptd$
from the inside of your function. The pronoun .
already refers to "the current column", so you can pass it to boxplot.stats()
directly:
training_imptd %>%
mutate_all(
~ifelse(. %in% boxplot.stats(.)$out, NA, .)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With