I am doing A/B testing and I am facing Simpson's paradox in my results (day vs month vs total duration of the test).
Thanks for your great help.
Further reading: http://en.wikipedia.org/wiki/Simpson%27s_paradox
First published Wed Mar 24, 2021. Simpson's Paradox is a statistical phenomenon where an association between two variables in a population emerges, disappears or reverses when the population is divided into subpopulations.
Simpson's paradox arises when there are hidden variables that split data into multiple separate distributions. Such a hidden variable is aptly referred to as a lurking variable, and they can often be difficult to identify.
Some authors reserve the label Simpson's paradox for a reversal in the direction of the marginal and partial association between two categorical variables. Some authors apply Simpson's paradox to reversals that occur with continuous as well as categorical variables.
Simpson's paradox is an extreme condition of confounding in which an apparent association between two variables is reversed when the data are analyzed within each stratum of a confounding variable.
If A is clearly, significantly better in individual A/B tests, while B scores better in aggregate, then the main implication is that you can't aggregate those data sets that way. A is better.
If the testing got the same results every day, you wouldn't get this clear result, even with varying sample sizes per day. So I think it additionally implies that something has changed. It could be anything, though. Maybe what you tested each day changed (perhaps in some very subtle way, like server speed). Or maybe the people you're testing it on changed (perhaps demographically, perhaps just in terms of their mood). That doesn't mean your testing is bad or invalid. It just means you're measuring something that's moving, and that makes things tricky.
And I might be miscalculating or misunderstanding the situation, but I think it is also necessarily true that you haven't been testing A and B the same number of times. That is, if on Monday you tested A 50 times and B 50 times, and on Tuesday you tested A 600 times and B 600 times, and so on, and A outscored B each day, then I don't see how you could get an aggregate result where B beats A. If this is true of your test setup, it certainly seems like something you could fix to make your data easier to reason about.
It's a little difficult to say without seeing the exact data & the dimensions you are testing, but generally speaking you want to make decisions based on the uncombined data. This article from Microsoft gives a pretty clear example of Simpson's paradox in software testing.
Can you provide a clean example of your combined and uncombined data and a brief summary of the test?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With