Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does Simpson's paradox imply in AB testing?

I am doing A/B testing and I am facing Simpson's paradox in my results (day vs month vs total duration of the test).

  1. Does it mean that my a/b testing is not correct/representative? (Some external factor impacted the testing?)
  2. If it is a sign of problem, what are the directions to follow?

Thanks for your great help.

Further reading: http://en.wikipedia.org/wiki/Simpson%27s_paradox

like image 446
Toto Avatar asked Jan 29 '10 18:01

Toto


People also ask

What does Simpson's paradox suggest?

First published Wed Mar 24, 2021. Simpson's Paradox is a statistical phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations.

What is the primary reason for Simpson's paradox?

Simpson's paradox arises when there are hidden variables that split data into multiple separate distributions. Such a hidden variable is aptly referred to as a lurking variable, and they can often be difficult to identify.

When Can Simpson's paradox be used?

Some authors reserve the label Simpson's paradox for a reversal in the direction of the marginal and partial association between two categorical variables. Some authors apply Simpson's paradox to reversals that occur with continuous as well as categorical variables.

What is meant by Simpson's paradox and how does it pertain to confounding?

Simpson's paradox is an extreme condition of confounding in which an apparent association between two variables is reversed when the data are analyzed within each stratum of a confounding variable.


2 Answers

If A is clearly, significantly better in individual A/B tests, while B scores better in aggregate, then the main implication is that you can't aggregate those data sets that way. A is better.

If the testing got the same results every day, you wouldn't get this clear result, even with varying sample sizes per day. So I think it additionally implies that something has changed. It could be anything, though. Maybe what you tested each day changed (perhaps in some very subtle way, like server speed). Or maybe the people you're testing it on changed (perhaps demographically, perhaps just in terms of their mood). That doesn't mean your testing is bad or invalid. It just means you're measuring something that's moving, and that makes things tricky.

And I might be miscalculating or misunderstanding the situation, but I think it is also necessarily true that you haven't been testing A and B the same number of times. That is, if on Monday you tested A 50 times and B 50 times, and on Tuesday you tested A 600 times and B 600 times, and so on, and A outscored B each day, then I don't see how you could get an aggregate result where B beats A. If this is true of your test setup, it certainly seems like something you could fix to make your data easier to reason about.

like image 80
Jason Orendorff Avatar answered Oct 11 '22 15:10

Jason Orendorff


It's a little difficult to say without seeing the exact data & the dimensions you are testing, but generally speaking you want to make decisions based on the uncombined data. This article from Microsoft gives a pretty clear example of Simpson's paradox in software testing.

Can you provide a clean example of your combined and uncombined data and a brief summary of the test?

like image 36
Chris Clark Avatar answered Oct 11 '22 15:10

Chris Clark