Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly does regplot()'s robust option do?

Related to this question, I am wondering what the robust option in seaborn's regplot() actually does.

The description reads as follows:

If True, use statsmodels to estimate a robust regression. This will de-weight outliers. Note that this is substantially more computationally intensive than standard linear regression, so you may wish to decrease the number of bootstrap resamples (n_boot) or set ci to None.

Does that mean that it is more similar to how Kendall or Spearman correlations work, as they are known to be robust against outliers? Or doesn't it have anything to do with each other? In other words, when calculating Kendall for some data, and drawing a scatterplot with regplot(), does it make sense to use robust=True?

like image 769
Bram Vanroy Avatar asked Jan 03 '23 11:01

Bram Vanroy


1 Answers

Correlation coefficients vs. Regression Coefficients

Kendall and Spearman correlations are measures of how well correlated two variables are, i.e. how closely related two variables are. The result is a correlation coefficient, which is a statistic that tells you how correlated things are (1 is a perfect relationship, 0 is a perfect absence of a relationship), and in a crude sense, the directionality of that correlation (-1 represents a negative slope). It is also important to note that both Spearman and Kendall correlation coefficients are sensitive to outliers, with the Spearman method being more sensitive.

Robust Linear Regression, on the other hand, is a special case of linear regression, which is a means of finding the relationship between 2 or more variables. You can think of it as a method of finding the "line of best fit". The result of linear regression is the regression coefficients, which is a measure of how (direction and slope) your response changes with your variables.

"Classical" vs. Robust linear regression

Often, linear regression uses Ordinary Least Squares, or OLS to find the regression coefficients, with the goal to minimize the sum of squares of your residuals (the square root of the difference between your estimated line and your actual data). This is quite sensitive to outliers:

x = np.arange(0,10,0.2)
y = (x*0.25)+np.random.normal(0,.1,50)
y[[12,14,18,24]] -= 4

sns.regplot(x,y, robust = False)

enter image description here

Notice how the line is dragged down by the outliers. In a lot of cases, this is the behaviour that you want to see.

Robust regression methods, on the other hand, typically use different measures to find the regression coefficients besides OLS, such as minimizing least trim squares, which is essentially the sum of squares over a subset of your data (in this sense, it's similar to bootstrapping). Typically, this is done iteratively, weighing the result accordingly, so that a given outlier ends up not having a huge effect on your coefficients. This is what statsmodels.robust.robust_linear_model.RLM does, which is being called when you use robust = True in seaborn. The result, on the same data as before:

sns.regplot(x,y,robust = True)

enter image description here

Notice that the line was not dragged down by your outliers. In many cases, this is not the behaviour that people want, but it depends on what you are doing...

Note: this is really computationally expensive (just for those 50 datapoints, it took about 5 seconds to run on my machine).

Which correlation coefficient to use?

If you want to keep reporting your Kendall correlation coefficient, do not use the robust argument when visualizing your data. This will be misleading, as the error sensitivity of Kendall will not be comparable to what is represented by your robust linear regression (to illustrate how much this can vary, in my data above, the kendall correlation was 0.85, the spearman's correlation coefficient was 0.93). sns.regplot() with robust=True calls by default statsmodels.robust.robust_linear_model.RLM, which uses the HuberT() criterion by default. Because of this, if you want to report something like correlation coefficient, my intuition is that you'll have to use some measure of the huber loss (you'll probably find more info about that here). Or, you can read this paper, which seems to have some insight about robust correlation coefficient alternatives.

like image 124
sacuL Avatar answered Jan 09 '23 18:01

sacuL