Finding a Sample Size for Your Split Test
Why Sample Sizes?
When talking about an experiment design, the most contentious topic without a doubt is that of sample size. Most commonly, “Why is the sample size so big?” The good news is that the sample size doesn’t have to be so big IF you’re willing to compromise in other ways. In this post, I’m going to delve into how to evaluate a sample size for a log-normal distributed dataset. Then, I’ll focus on how changing a couple parameters can dramatically reduce the sample sizes we need. Lastly, I’ll show how using a different statistical distribution (to a binomial distribution) can alter the analysis.
How Are We Going To Do This?
The easiest way to do this analysis (if you know how to program) is to:
- Model the variable we’ll be testing
- Simulate changing it
- Compare the changed distribution to the original many many thousands of times
Important: I’m going about this backwards, in a way. I’m going to say what effect I want to be able to measure and how confident I want to be, and then I back into the sample size by trying hundreds of thousands of tests. I then look for the minimum sample size that will meet the constraints I set.
Here are more specific steps:
- Look at the real data. Get real data and see what kind of distribution it forms.
- Create a mathematical model of the data that you can use to simulate shifts.
- Run thousands of simulated statistical tests between a simulated control and a simulated variant where we have a known improvement. Count the number of times we’re able to detect the given shift.
- The number of times we detect the shift divided by the number of simulated split tests is the likelihood to detect an effect of the given magnitude with the given sample size. This is also known as the statistical power.
- Run other simulations varying sample size and the change between the simulated control and the simulated variant.
Pretending To Be A Normal Log
On our way to analyzing our sample sizes for how much customers spend, we’re going to need to figure out how to first model their purchases. My first step is always to look at the data I want to model. Since most of my examples of log-normal data come from work, we’ll have to use a generated dataset and pretend like it’s natural.
%%R generated_sales_data <- round(rlnorm(10000, mean=3.21, sd=.6), 2) write.table(generated_sales_data, "raw_data/sales_data.csv", col.names=FALSE, row.names=FALSE, sep=',', quote=FALSE)
Now that we have our “real” sales data, let’s forget we made it at all and work backwards to see how we can reverse engineer the parameters I used to make it up. First, let’s view our distribution and see what kind of distribution we have.
%%R library('ggplot2') sales <- read.csv('~/Documents/Notebooks/raw_data/sales_data.csv', header=F, col.names=c('purchase_amount')) p <- ggplot(sales, aes(x=purchase_amount)) + geom_histogram(binwidth=.5) + xlim(0, 150) + xlab("Customer amount spent (in $)") + ylab("# of customers") + ggtitle("Amounts spent by individual consumers") + theme(axis.title.y = element_text(angle = 0)) print(p)
This looks a lot like a log-normal distribution. We can model a random distribution that looks like this by computing a couple of values from the above data. We need to find the mean of the log of each of the prices and also the standard deviation of the log of each price. Here’s how that comes together:
%%R library('ggplot2') df <- read.csv('~/Documents/Notebooks/raw_data/sales_data.csv', header=T, col.names=c('amount_spent')) purchase.count <- length(df$amount_spent) purchase.log_mean <- mean(log(df$amount_spent)) purchase.log_stdev <- sd(log(df$amount_spent)) print(paste(c("Standard mean of amount spent:", round(mean(df$amount_spent),2)), sep='')) print(paste(c("Standard deviation of amount spent:", round(sd(df$amount_spent),2)), sep='')) print(paste(c("Log mean of amount spent:", purchase.log_mean), sep='')) print(paste(c("Log standard deviation of amount spent:", purchase.log_stdev), sep=''))
 "Standard mean of amount spent:" "29.66"  "Standard deviation of amount spent:" "19.48"  "Log mean of amount spent:" "3.20924196153511"  "Log standard deviation of amount spent:"  "0.601137563076673"
Notice how different the log-mean and log-standard deviation are from their typical counter parts. When I first learned to do this, I always hoped I could just use the standard mean and standard deviation, but they don’t even come close. So much for being lazy! ;)
Now that we have these two parameters, we should be able to create a pretty good model of our data.
%%R # Create modeled data simulated_purchases <- data.frame(amount_spent=rlnorm(purchase.count, mean=purchase.log_mean, sd=purchase.log_stdev)) # Graph it p <- ggplot(simulated_purchases, aes(x=amount_spent)) + geom_histogram(binwidth=0.5) + xlim(0, 150) + xlab("price") + ylab("# of orders") + ggtitle("Simulated price frequency from one day") + theme(axis.title.y = element_text(angle = 0)) print(p)
Looking at both the real and the simulated distribution of how much customers spent we can see they’re pretty similar. Another little difference you may notice in the wild is your simulated histogram won’t have some of the sporadic sharp spikes your real data has. Not a big deal, but just pointing it out in case you’re ever left thinking wondering about it. So now what?
Deciding On The Effect Size To Test For
Something that can dramatically affect our sample size is the size of effect we decide to test for. This is something that is very open to be changed in order to mitigate risk. As we’ll see in the examples that follow, larger shifts in our data require dramatically fewer samples. Since most of the features we test have a low likelihood of dramatically improving results though, we tend to stick with looking for ~2% shift in our numbers.
This is a primary place where we can tune our tests. The hard part is how do we know what effect to measure for? We’d like to test for as small of an effect as possible right? Sure. At some point though it ends up that we’re just running the test on all of our customers or that the test takes forever to complete. You need to choose a number that will allow your team to complete most tests in a week (if you can) and that balances the risk of negatively affecting customers with the risk of mistakenly thinking a feature under test is ineffective.
Let me repeat this because it is vital: We have to balance the possibility of our test not being sensitive enough with the possibility of negatively affecting our customers. We have the ability to work with this but it requires an open dialog with concrete constraints between stakeholders.
(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)