Big Data/Analytics Zone is brought to you in partnership with:

Justin Bozonier is the Product Optimization Specialist at GrubHub formerly Sr. Developer/Analyst at Cheezburger. He's engineered a large, scalable analytics system, worked on actuarial modeling software. As Product Optimization Specialist he is currently leading split test design, implementation, and analysis. The opinions expressed here represent my own and not those of my employer. Justin is a DZone MVB and is not an employee of DZone and has posted 27 posts at DZone. You can read more from them at their website. View Full User Profile

Finding a Sample Size for Your Split Test

10.15.2013
| 1307 views |
  • submit to reddit

Testing Metrics Like Conversion Rate

We’ve now covered how to test our hypothetical customer purchase amounts which tended to follow what’s called a log-normal distribution. Another distribution we see fairly often is a binomial distribution. Specifically we see this in conversion rates. Let’s walk through a similar example and find the sample size we need to test a 3% lift in conversion rate. NOTE: This really means a lift to conversion rate, say, from 8% to 8.25%. We’ll also assume that we want to be able to measure this shift 95% of the time.

Conversion rates follow a binomial rather than log-normal distribution. This is because each data point is either a success or failure. Compare this to our log-normal data which were some number between greater than zero to infinity. Simulating this is a bit different but just as straight forward.

For the sake of argument let’s say the conversion rate we want to test is typically around 8%. We can generate the distribution like so:

In[83]:

%%R
number_of_samples <- 1000
control_conversion_rate <- 0.08

successes <- factor(rbinom(number_of_samples, 1, control_conversion_rate) == T)
control_distribution <- data.frame(success = successes)
p <- ggplot(control_distribution, aes(x=success)) + 
    geom_histogram(binwidth=1) + 
    xlab("Purchased?") +
    ylab("# of visitors") + 
    ggtitle("Proportion Of visitors purchasing vs not") + 
    theme(axis.title.y = element_text(angle = 0))
print(p)

Binomial distribution

This looks a lot different from our purchase amount distribution. Here’s a refresher so you can compare:

In[84]:

%%R 
sales <- read.csv('~/Documents/Notebooks/raw_data/sales_data.csv', header=F, col.names=c('purchase_amount'))

p <- ggplot(sales, aes(x=purchase_amount)) + 
    geom_histogram(binwidth=.5) + 
    xlim(0, 150) + 
    xlab("Customer amount spent (in $)") +
    ylab("# of customers") + 
    ggtitle("Amounts spent by individual consumers") + 
    theme(axis.title.y = element_text(angle = 0))
print(p)

Lognormal distribution

These distributions are radically different. For modeling the customer purchases, I needed the mean and standard deviation. I can get the mean here, but I can’t get a standard deviation. What would that even mean in the case where we have either a success or failure? Due to this, we need to use a different statistical test to measure our results. This test is known as a binomial test. Like I did for purchases, I’m going to start by simulating a test between two equivalent distributions. No changes to either one yet. In this case, both distributions are “binomial” distributions.

In[85]:

%%R
number_of_samples <- 7000
control_conversion_rate <- .08

simulation_results <- mapply(function(x){
    control <- factor(rbinom(number_of_samples, 1, control_conversion_rate) == T)
    results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0, 2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 5.65%"

Again, we find a significant change only ~5% of the time, which matches with our 95% confident statistical test. Now let’s apply the effect size we want to test for and start looking for the point of where we get to 95% confidence in our split test.

In[86]:

%%R
number_of_samples <- 11000
effect_to_measure <- 1.03

simulation_results <- mapply(function(x){
    control <- factor(rbinom(number_of_samples, 1, control_conversion_rate*effect_to_measure) == T)
    results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0, 2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 23.74%"

Wow. So, not even close. Let's guess and check until we get to 95%.

In[91]:

%%R
number_of_samples <- 145000
effect_to_measure <- 1.03

simulation_results <- mapply(function(x){
    control <- factor(rbinom(number_of_samples, 1, control_conversion_rate*effect_to_measure) == T)
    results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater')
    .05 >= results[3]$p.value
}, seq(0, 2000, by=1))

percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results)
print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 95.35%"

After a lot of guess-and-check, I finally found a sample size that gets us close to being able to reliably measure a 3% shift. That’s a lot, right? An important thing to remember, though, is that since this is conversion rate, we only need this many visitors, whether they order or not. Still, compared to our purchase amounts, this is a very different scenario. Or is it?

See, in order to get that 24,000 purchases so we can run our statistical test, we would need to capture enough visitors to our web site as well. Since 92% of them don’t order (since our conversion rate is 8%) that means we have to show our experiment to a lot more visitors than we might expect.

Given that 8% of our visitors convert to customers and if we need 24,000 customers to measure a 2% significance in purchase amount lift, then we need will need to have almost 300,000 visitors to our web site in our experiment per variation to ensure we get enough customers. We only need 145,000 visitors in our experiment to measure a 3% significance in our conversion rate. At least we would kill two birds with one stone.

Normal, Log-Normal, Binomial … What’s the Point?

The point is the shape of the distribution is just as important to sample size as the effect size we’re looking for, and the sensitivity to it. With a pure mathematical solution, we would need a different formula for each distribution and any others we encountered. Here, we use simulation as a one size fits all solution.

For further reading about these distributions you can refer to wikipedia:

The Case For Brute Force

Without needing to delve too deeply into mathematical theory, we’ve used simulations to help us conduct a statistical power analysis and determine a pretty accurate approximation of the number of samples we need in each variation to identify a 2% shift in the mean of our purchase amounts! Then we were able to put a hard stop on the sample size of our test and instead played with changing the statistical power of the test as well as the effect size we’re measuring.

It’s an imperfect method, but we can get more accuracy. All we need to do is tune our algorithm to run more simulated split tests. That takes up some extra computer time, but that’s MUCH cheaper than human time.

This paper puts it best:

“Increases in computing power have extended power analyses to many new areas, and R’s capability to run many repeated stochastic simulations is a great help. Paradoxically, the mathematical difficulty of deriving power formulas is a great equalizer: since even research statisticians typically use simulations to estimate power, it’s now possible (by learning simulation, which is easier than learning advanced mathematical statistics) to work on an equal footing with even cutting-edge researchers.” http://ms.mcmaster.ca/~bolker/emdbook/chap5A.pdf

And here’s an article from the company which produces SAS (a huge statistical software) on also running stochastic simulations in order to conduct similar power analyses: http://blogs.sas.com/content/iml/2013/05/30/simulation-power/

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)

Published at DZone with permission of Justin Bozonier, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)