Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Excluding outliers, from the regression line fitted through a scatterplot, without removing the outlier from the plot

Tags:

r

ggplot2

I have data as follows, for which I run ggplot code below:

data <- structure(list(country_mean_rep = structure(c(73.6995708154506, 
93.5501285347044, 85.1529051987768, 91.1017369727047, 79.5562130177515, 
84.6751054852321, 89.8, 86.8826405867971, 94.2247191011236, 70.2321428571429, 
88.4107142857143), label = "label", format.stata = "%9.2f"), 
    country_mean_crime = c(0.0944206008583691, 0.0565552699228792, 
    0.0336391437308868, 0.205955334987593, 0.130177514792899, 
    0.282700421940928, 0.220512820512821, 0.415647921760391, 
    0.387640449438202, 0.200892857142857, 0.292207792207792), 
    country_name = structure(c(1L, 2L, 3L, 4L, 5L, 7L, 11L, 12L, 
    14L, 16L, 20L), .Label = c("Albania", "Armenia", "Azerbaijan", 
    "Belarus", "Bosnia and Herzegovina", "Brazil", "Bulgaria", 
    "Cambodia", "Chile", "CostaRica", "Croatia", "Czech", "Ecuador", 
    "Estonia", "FYROM", "Georgia", "Germany", "Greece", "Guyana", 
    "Hungary", "Ireland", "Kazakhstan", "Kenya", "Kyrgyzstan", 
    "Latvia", "Lithuania", "Malawi", "Mali", "Moldova", "Philippines", 
    "Poland", "Portugal", "Romania", "Russia", "Senegal", "Serbia&Montenegro", 
    "Slovakia", "Slovenia", "South Africa", "South Korea", "Spain", 
    "SriLanka", "Tajikistan", "Turkey", "Ukraine", "Uzbekistan", 
    "Vietnam"), class = "factor")), row.names = c(NA, -11L), class = c("data.table", 
"data.frame"))

# On which I like to run the following code:

ggplot(data, aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_point() + 
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_text(aes(label=country_name), nudge_y=0.02) +
  theme_bw()

enter image description here

Now let's say that the Czech Republic is an outlier, which I want to remove for the fits I am doing (especially the linear one). Please note that I understand there is nothing wrong with the Czech Republic in the example, I need to know this for a proper outlier in my actual data.

Is there some way of excluding it only from the fit, while keeping the dot in the plot?

like image 805
Tom Avatar asked Aug 31 '25 10:08

Tom


1 Answers

One way to do it would be to include different data plots:

ggplot(subset(data, country_name != 'Czech'), aes(x=country_mean_rep, y=country_mean_crime)) + 
  geom_smooth(aes(colour="linear", fill="linear"), 
              method="lm", 
              formula=y ~ x, ) + 
  geom_smooth(aes(colour="quadratic", fill="quadratic"), 
              method="lm", 
              formula=y ~ x + I(x^2)) + 
  geom_smooth(aes(colour="cubic", fill="cubic"), 
              method="lm", 
              formula=y ~ x + I(x^2) + I(x^3)) + 
  labs(colour="Functional Form", fill="Functional Form") +
  geom_point(data = data, inherit.aes = FALSE, aes(x = country_mean_rep, y = country_mean_crime)) +
  geom_text(data = data, aes(label=country_name, x = country_mean_rep, y = country_mean_crime), inherit.aes = FALSE, nudge_y=0.02) +
  theme_bw()

In this case, the 3 linear models use the subsetted data whereas the calls to geom_point and geom_text do not inherit the original aestetics.

like image 181
Cole Avatar answered Sep 03 '25 01:09

Cole