Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Explain ggplot2 warning: "Removed k rows containing missing values"

Tags:

r

ggplot2

I get this warning when I am trying to generate a plot with ggplot.

After researching online for a while many suggested that my database contains either null values or missing data in general, which was not the case.

In this question the accepted answer says the following:

The warning means that some elements are removed because they fall out of the specified range

I was wondering what exactly does this range refer to and how can someone manually increase this range in order to avoid all warnings?

like image 929
ksm001 Avatar asked Sep 10 '15 14:09

ksm001


2 Answers

The behavior you're seeing is due to how ggplot2 deals with data that are outside the axis ranges of the plot. You can change this behavior depending on whether you use scale_y_continuous (or, equivalently, ylim) or coord_cartesian to set axis ranges, as explained below.

library(ggplot2)  # All points are visible in the plot ggplot(mtcars, aes(mpg, hp)) +    geom_point() 

In the code below, one point with hp = 335 is outside the y-range of the plot. Also, because we used scale_y_continuous to set the y-axis range, this point is not included in any other statistics or summary measures calculated by ggplot, such as the linear regression line.

ggplot(mtcars, aes(mpg, hp)) +    geom_point() +   scale_y_continuous(limits=c(0,300)) +  # Change this to limits=c(0,335) and the warning disappars   geom_smooth(method="lm")  Warning messages: 1: Removed 1 rows containing missing values (stat_smooth).  2: Removed 1 rows containing missing values (geom_point). 

In the code below, the point with hp = 335 is still outside the y-range of the plot, but this point is nevertheless included in any statistics or summary measures that ggplot calculates, such as the linear regression line. This is because we used coord_cartesian to set the y-axis range, and this function does not exclude points that are outside the plot ranges when it does other calculations on the data.

If you compare this and the previous plot, you can see that the linear regression line in the second plot has a slightly steeper slope, because the point with hp=335 is included when calculating the regression line, even though it's not visible in the plot.

ggplot(mtcars, aes(mpg, hp)) +    geom_point() +   coord_cartesian(ylim=c(0,300)) +   geom_smooth(method="lm") 
like image 83
eipi10 Avatar answered Sep 21 '22 18:09

eipi10


Just for the shake of completing the answer given by eipi10.

I was facing the same problem, without using scale_y_continuous nor coord_cartesian.

The conflict was coming from the x axis, where I defined limits = c(1, 30). It seems such limits do not provide enough space if you want to "dodge" your bars, so R still throws the error

Removed 8 rows containing missing values (geom_bar)

Adjusting the limits of the x axis to limits = c(0, 31) solved the problem.

In conclusion, even if you are not putting limits to your y axis, check out your x axis' behavior to ensure you have enough space

like image 31
davidnortes Avatar answered Sep 21 '22 18:09

davidnortes