In a boxplot
I've set the option outline=FALSE
to remove the outliers.
Now I'd like to include points
that show the mean in the boxplot. Obviously, the means calculated using mean
include the outliers.
How can the very same outliers be removed from a dataframe so that the calculated mean corresponds to the data shown in the boxplot?
I know how outliers can be removed, but which settings are used by the outline
option from boxplot
internally? Unfortunately, the manual does not give any clarifications.
We can remove outliers in R by setting the outlier. shape argument to NA. In addition, the coord_cartesian() function will be used to reject all outliers that exceed or below a given quartile. The y-axis of ggplot2 is not automatically adjusted.
We can calculate the mean and standard deviation of a given sample, then calculate the cut-off for identifying outliers as more than 3 standard deviations from the mean. We can then identify outliers as those examples that fall outside of the defined lower and upper limits.
To answer the second part of your question, about how the outliers are choosen, it's good to remind how the boxplot is constructed:
If you take the hypothesis that your data has a normal distribution, there are this amount of data outside each whisker:
1-pnorm(qnorm(0.75)+1.5*2*qnorm(0.75))
being 0.0035. Therefore, a normal variable has 0.7% of "boxplot outliers".
But this is not a very "reliable" way to detect outliers, there are packages specifically designed for this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With