## Testing Metrics Like Conversion Rate

We’ve now covered how to test our hypothetical customer purchase amounts which tended to follow what’s called a log-normal distribution. Another distribution we see fairly often is a binomial distribution. Specifically we see this in conversion rates. Let’s walk through a similar example and find the sample size we need to test a 3% lift in conversion rate. **NOTE:** This really means a lift to conversion rate, say, from 8% to 8.25%. We’ll also assume that we want to be able to measure this shift 95% of the time.

Conversion rates follow a binomial rather than log-normal distribution. This is because each data point is either a success or failure. Compare this to our log-normal data which were some number between greater than zero to infinity. Simulating this is a bit different but just as straight forward.

For the sake of argument let’s say the conversion rate we want to test is typically around 8%. We can generate the distribution like so:

In[83]:

%%R number_of_samples <- 1000 control_conversion_rate <- 0.08 successes <- factor(rbinom(number_of_samples, 1, control_conversion_rate) == T) control_distribution <- data.frame(success = successes) p <- ggplot(control_distribution, aes(x=success)) + geom_histogram(binwidth=1) + xlab("Purchased?") + ylab("# of visitors") + ggtitle("Proportion Of visitors purchasing vs not") + theme(axis.title.y = element_text(angle = 0)) print(p)

This looks a lot different from our purchase amount distribution. Here’s a refresher so you can compare:

In[84]:

%%R sales <- read.csv('~/Documents/Notebooks/raw_data/sales_data.csv', header=F, col.names=c('purchase_amount')) p <- ggplot(sales, aes(x=purchase_amount)) + geom_histogram(binwidth=.5) + xlim(0, 150) + xlab("Customer amount spent (in $)") + ylab("# of customers") + ggtitle("Amounts spent by individual consumers") + theme(axis.title.y = element_text(angle = 0)) print(p)

These distributions are radically different. For modeling the customer purchases, I needed the mean and standard deviation. I can get the mean here, but I can’t get a standard deviation. What would that even mean in the case where we have either a success or failure? Due to this, we need to use a different statistical test to measure our results. This test is known as a binomial test. Like I did for purchases, I’m going to start by simulating a test between two equivalent distributions. No changes to either one yet. In this case, both distributions are “binomial” distributions.

In[85]:

%%R number_of_samples <- 7000 control_conversion_rate <- .08 simulation_results <- mapply(function(x){ control <- factor(rbinom(number_of_samples, 1, control_conversion_rate) == T) results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater') .05 >= results[3]$p.value }, seq(0, 2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 5.65%"

Again, we find a significant change only ~5% of the time, which matches with our 95% confident statistical test. Now let’s apply the effect size we want to test for and start looking for the point of where we get to 95% confidence in our split test.

In[86]:

%%R number_of_samples <- 11000 effect_to_measure <- 1.03 simulation_results <- mapply(function(x){ control <- factor(rbinom(number_of_samples, 1, control_conversion_rate*effect_to_measure) == T) results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater') .05 >= results[3]$p.value }, seq(0, 2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 23.74%"

Wow. So, not even close. Let's guess and check until we get to 95%.

In[91]:

%%R number_of_samples <- 145000 effect_to_measure <- 1.03 simulation_results <- mapply(function(x){ control <- factor(rbinom(number_of_samples, 1, control_conversion_rate*effect_to_measure) == T) results <- binom.test(length(control[control == T]), number_of_samples, control_conversion_rate, alternative='greater') .05 >= results[3]$p.value }, seq(0, 2000, by=1)) percent_of_time_equivalent <- length(simulation_results[simulation_results==TRUE]) / length(simulation_results) print(paste(c("Percentage of time effect detected: ", round(percent_of_time_equivalent*100, digits=2), "%"), collapse=''))

[1] "Percentage of time effect detected: 95.35%"

After a lot of guess-and-check, I finally found a sample size that gets us close to being able to reliably measure a 3% shift. That’s a lot, right? An important thing to remember, though, is that since this is conversion rate, we only need this many visitors, whether they order or not. Still, compared to our purchase amounts, this is a very different scenario. Or is it?

See, in order to get that 24,000 purchases so we can run our statistical test, we would need to capture enough visitors to our web site as well. Since 92% of them don’t order (since our conversion rate is 8%) that means we have to show our experiment to a lot more visitors than we might expect.

Given that 8% of our visitors convert to customers and if we need 24,000 customers to measure a 2% significance in purchase amount lift, then we need will need to have almost 300,000 visitors to our web site in our experiment per variation to ensure we get enough customers. We only need 145,000 visitors in our experiment to measure a 3% significance in our conversion rate. At least we would kill two birds with one stone.

## Normal, Log-Normal, Binomial … What’s the Point?

The point is the shape of the distribution is just as important to sample size as the effect size we’re looking for, and the sensitivity to it. With a pure mathematical solution, we would need a different formula for each distribution and any others we encountered. Here, we use simulation as a one size fits all solution.

For further reading about these distributions you can refer to wikipedia:

- Normal Distributions: http://en.wikipedia.org/wiki/Normal_distribution
- Log-Normal Distributions: http://en.wikipedia.org/wiki/Log-normal
- Binomial Distributions: http://en.wikipedia.org/wiki/Binomial_distribution
- Exponential Distributions: http://en.wikipedia.org/wiki/Exponential_distribution

## The Case For Brute Force

Without needing to delve too deeply into mathematical theory, we’ve used simulations to help us conduct a statistical power analysis and determine a pretty accurate approximation of the number of samples we need in each variation to identify a 2% shift in the mean of our purchase amounts! Then we were able to put a hard stop on the sample size of our test and instead played with changing the statistical power of the test as well as the effect size we’re measuring.

It’s an imperfect method, but we can get more accuracy. All we need to do is tune our algorithm to run more simulated split tests. That takes up some extra computer time, but that’s MUCH cheaper than human time.

This paper puts it best:

“Increases in computing power have extended power analyses to many new areas, and R’s capability to run many repeated stochastic simulations is a great help. Paradoxically, the mathematical difficulty of deriving power formulas is a great equalizer: since even research statisticians typically use simulations to estimate power, it’s now possible (by learning simulation, which is easier than learning advanced mathematical statistics) to work on an equal footing with even cutting-edge researchers.” http://ms.mcmaster.ca/~bolker/emdbook/chap5A.pdf

And here’s an article from the company which produces SAS (a huge statistical software) on also running stochastic simulations in order to conduct similar power analyses: http://blogs.sas.com/content/iml/2013/05/30/simulation-power/

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)

- « first
- ‹ previous
- 1
- 2
- 3