Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Finding a Sample Size for Your Split Test

10.15.2013
| 1195 views |
  • submit to reddit

Modeling A Split Test Looking For A 2% Effect

Let’s assume we’ve decided we need to test for an effect of at least 2%. This means we want to detect that the measured mean has increased by 2%.

If you recall, above we showed that we can model the variable we want to test (purchase amounts) as a log-normal distribution. Let’s try modeling a slight improvement to our variation and see what happens when we compare the two. Rather than rederive the constants we used before I’ll just use the values of mean and standard deviation we’ve already calculated. First, let’s try comparing two different random distributions that should be equivalent (in other words, no statistically significant difference in the means).

In[72]:

%%R 
number_of_samples <- 1000
control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
equivalent_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)

results <- t.test(equivalent_variation, control, alternative='greater')
print(results)
Welch Two Sample t-test

data:  equivalent_variation and control
t = -0.2264, df = 1997.376, p-value = 0.5895
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-1.560934       Inf
sample estimates:
mean of x mean of y 
28.92833  29.11709  

So ... great! We have statistical test results, but what do they mean? Search through the output for the p-value. Let’s accept the standard definition that a p-value of less than .05 is statistically significant. In this case our p-value > .05 so we conclude that there’s no evidence to support that our variation mean is greater than our control.

Great. That matches our expectations.

Let’s go a bit further now. Let’s run this test a couple thousand times and see how often we get a result that matches our expectations. We should see a statistically significant difference only around 5% of the time. Running this simulation takes less than a second.

In[73]:

%%R
number_of_samples <- 1000

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    equivalent_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    results <- t.test(equivalent_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 5.55%"

Pretty close to 5% of the time the results show a statistically significant difference! Perfect. This validates that the t-test is working pretty well for our needs so far.

This keeps with statistical testing theory and how the t.test works.

We have two concepts that can seem VERY similar so allow me to be a bit more detailed here:

  • Statistical significance- 95% significant means that 95% of the time the effect we measure will NOT be due to random chance.
  • Statistical power- Percentage of time that an effect of a certain size will be detected. Refer here for more information: http://en.wikipedia.org/wiki/Statistical_power
  • Effect size- The magnitude of the change to be measured. Also referred to as sensitivity.

When you hear statisticians discussing 95% significance, what they’re really saying is that if they ran the experiment 100 times they expect that one time out of twenty they will mistakenly detect an effect that’s due simply to random chance. The importance of this is that we never get to a point where we are 100% confident. We can approach it, but then that basically just equates to measuring an effect on our entire population of customers. Might as well just roll the feature out at that point. There’s a give and take here so it’s another place where we can make compromises.

If we say we want to measure for an effect size of 2% with 95% statistical power, that means we want to be able to detect an effect of at least 2% at least 95% of the time. Maybe we’re OK with throwing out a positive effect twice as often (90% confident). That’s a perfectly valid decision to make as long as there’s an understanding of how that impacts our testing.

Next we need to see what percent of the time we detect an improvement if we shift the variation to have a small difference. Let’s do the same test as above but let’s shift our variation by 2%. Let’s call our shot and predict what should happen. I don’t know exactly what to expect, but I do know that since there is now a change, we should detect a change more often than 5% of the time.

In[74]:

%%R
number_of_samples <- 1000
effect_size <- 1.02

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 17.79%"

Big change! So now we're seeing that these two distributions are showing a statistically significant difference ~16% of the time! This means if we were okay with only detecting an improvement in our test features ~16% of the time we would only need 1,000 samples. For negative customer experience potential, this is pretty risk adverse. From an experimentation perspective, however, it’s pretty terrible. Our tests would hardly even be repeatable. When we first started testing we actually started here. It was so confusing to get different results every test run.

In a real split test, when there’s a change, we ideally want to be able to measure a change at least 95% of the time. Again, that’s something we can explore shifting but let’s take it as a given right now. Let’s try increasing our sample size and see where it gets us.

In[75]:

%%R
number_of_samples <- 10000
effect_size <- 1.02

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 68.72%"

I took a stab in the dark and tried 10x the samples. That got us MUCH closer to detecting the change 95% of the time but we’re not quite there yet. Notice how very non-mathematical this approach is? This is a chief draw of using simulation to perform complex analysis. Rather than having to find a formula or tool online and just trust it, we can brute-force the solution in a verifiable way.

We’re at ~67% with 10,000 samples. Let’s triple the sample size and see if that helps us detect the 2% effect at least 95% of the time.

In[76]:

%%R
number_of_samples <- 30000
effect_size <- 1.02

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))
[1] "Percentage of time effect detected: 97.55%"

Excellent! 98%. We overshot our goal a little so now I fine tune. Again it's an imperfect process.

I explain later on how exactly a statistical test of conversion rate differs from purchase amounts.

Let’s continue our process of guess and check. 98% is more certainty than we want for this test. Let’s see if we can get our sample size (and risk!) down a little more and still be at ~95%.

In[78]:

%%R
number_of_samples <- 24000
effect_size <- 1.02

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 95.55%"

Changing our sample size to 24,000 bounces us around 95% pretty well. Voila! We’ve discovered the answer to our question:

In order to measure at least a 2% effect on customer purchase amount 95% of the time, we need a sample size around 24,000 purchases per variation.

Now obviously if I had to go through this process for every test we did I would have a LOT of work. Instead I’ve just made HUGE tables where I’ve precomputed some common values and then I filter the spreadsheets in Excel based upon the constraints I have for the test at hand.

Changing Constraints for a Smaller Sample Size

Ah, here is where it gets really interesting! What if we decide we don’t want to affect any more than 7,000 customers (or any arbitrary number)? Can we still use this technique? Absolutely. Let’s walk through this using the same purchase amounts we’ve already modeled and see how things change. Let’s state our question more formally:

In order to limit our test to only 7,000 customers per variation, what is the maximum effect size and confidence level we can test purchase amounts at? Let’s start with a wild guess and see what confidence we get when measuring a 2% shift in customer:

In[79]:

%%R
number_of_samples <- 7000
effect_size <- 1.02

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 54.22%"

We're able to measure a true 2% shift ~54-57% of the time! Kind of terrible. So let’s try changing things. What if we said we were OK with detecting changes of 3% or more?

In[80]:

%%R
number_of_samples <- 7000
effect_size <- 1.03

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 84.31%"

OK. Now we've found that we can correctly detect a 3% shift in the mean of purchase amounts ~84-87% of the time while only affecting 7,000 customers per variation. Let’s get the statistical power up to at least 90%. That means we need to increase the size of the shift we can measure. Let’s try 3.4%.

In[81]:

%%R
number_of_samples <- 7000
effect_size <- 1.034

simulation_results <- mapply(function(x){
    control <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)
    improved_variation <- rlnorm(number_of_samples, mean=purchase.log_mean, sd=purchase.log_stdev)*effect_size
    results <- t.test(improved_variation, control, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0,2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 90.05%"

Yeah, I cheated above. I had to try 3.5%, 3.1%, 3.2%, and 3.3% first. Finally I tried 3.4%, and that got me closest. This is why generating a table has been so helpful. It’s allowed me to just look at the possibilities for a given constraint. Back to the problem at hand.

Here’s the formal answer to our question: In order to only affect 7,000 customers per variation, we should test for a 3.4% shift in the mean of our purchase amounts with a ~90% confidence in the result.

Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)