Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sample size for A/B fisher test significance

Given the results for a simple A / B test...

        A   B
clicked 8   60
ignored 192 1940

( ie a conversation rate of A 4% and B 3% )

... a fisher test in R quite rightly says there's no significant difference

> fisher.test(data.frame(A=c(8,192), B=c(60,1940)))
...
p-value = 0.3933
...

But what function is available in R to tell me how much I need to increase my sample size to get to a p-value of say 0.05?

I could just increase the A values (in their proportion) until I get to it but there's got to be a better way? Perhaps pwr.2p2n.test [1] is somehow usable?

[1] http://rss.acs.unt.edu/Rdoc/library/pwr/html/pwr.2p2n.test.html

like image 605
mat kelcey Avatar asked Jun 03 '12 05:06

mat kelcey


People also ask

What should be the sample size for AB testing?

To A/B test a sample of your list, you need to have a decently large list size — at least 1,000 contacts. If you have fewer than that in your list, the proportion of your list that you need to A/B test to get statistically significant results gets larger and larger.

What sample size is statistically significant?

Most statisticians agree that the minimum sample size to get any kind of meaningful result is 100. If your population is less than 100 then you really need to survey all of them.

What is statistical significance in AB testing?

In the context of AB testing experiments, statistical significance is how likely it is that the difference between your experiment's control version and test version isn't due to error or random chance. For example, if you run a test with a 95% significance level, you can be 95% confident that the differences are real.

How many users do you need for AB testing?

Thrilling zone. With between 10,000 and 100,000 visitors a month, AB testing can be a real challenge, as an improvement in conversion rate of at least 9% is needed to be reliable.


1 Answers

power.prop.test() should do this for you. In order to get the math to work I converted your 'ignored' data to impressions by summing up your columns.

> power.prop.test(p1=8/200, p2=60/2000, power=0.8, sig.level=0.05)

     Two-sample comparison of proportions power calculation 

              n = 5300.739
             p1 = 0.04
             p2 = 0.03
      sig.level = 0.05
          power = 0.8
    alternative = two.sided

NOTE: n is number in *each* group

That gives 5301, which is for each group, so your sample size needs to be 10600. Subtracting out the 2200 that have already run, you have 8400 "tests" to go.

In this case:

  • sig.level is the same as your p-value.
  • power is the likelihood of finding significant results that exist within your sample. This is somewhat arbitrary, 80% is a common choice. Note that choosing 80% means that 20% of the time you won't find significance when you should. Increasing the power means you'll need a larger sample size to reach your desired significance level.

If you wanted to decide how much longer it will take to reach significance, divide 8400 by the number of impressions per day. That can help determine if its worth while to continue the test.

You can also use this function to determine required sample size before testing begins. There's a nice write-up describing this on the 37 Signals blog.

This is a native R function, so you won't need to add or load any packages. Other than that I can't say how similar this is to pwr.p2pn.test().

like image 94
Lenwood Avatar answered Oct 06 '22 14:10

Lenwood