How to design conclusive A/B Testing experiments?
As a Data Engineer, when I am solving a problem, I often ask myself what if there was no data for this problem? What if I had to make a design change where I had no clue how the market/users would react to it? Is there a more reliable method for decision-making?
This post will walk you through five critical steps to design a robust and conclusive experiment for A/B testing. Following are the major points which we’ll be covering:
-
Introduction to A/B testing.
-
Choosing the right business metrics.
-
Statistical review.
-
Designing the experiment.
-
AnalyzeThere are many A/B testing software present in the market. In this blog, I won’t be focussing on any particular tool for A/B testing, but instead, on the working of it. I’d be discussing a statistical approach to test if a new landing page of an e-tailer would help them increase user engagement on the platform.
1. Introduction to A/B testing
First of all, what exactly is A/B testing?
A/B testing is an experimentation methodology used online to test a new product or feature by exposing 2 different versions of the product to a split user base. The first version is the existing feature or the old feature which is known as the Control Group and the new feature is known as the experiment/variation Group.
Take 2 sets of users, one set would have access to the control group — the existing feature and the other set would have access to the Experiment group — the new version you’re planning to roll out. Based on the data that you gather, you decide which version of the product is better.
Hence, A/B testing has two variants:
What all can we achieve using A/B testing?We can test a wide variety of things using A/B testings ranging from some new feature to an online portal/dashboard, additions to UI/UX, or a different look for your application. You can also test the features which the users might not even notice just like Amazon determined that every 100ms increase in page load time decreased sales by 1%.
All big players — Google, Amazon, Microsoft, Netflix, Paypal, et cetera use A/B testing. To quote a few examples:
Amazon initially decided to launch their first personalized product recommendations based on an A/B test showing a huge revenue increase by adding that feature.Google tested 41 different shades of blue.LinkedIn tested whether to use the top slot on a user’s stream for top news articles or an encouragement to add more contacts. (See the first paragraph in “A/B testing with view based JSON” section.)Given the extensive usage of the methodology in the industry and what all can we achieve using A/B testing, we need to know
What’s it NOT for?
So, if you’re trying to use A/B tests to record the reaction around a totally new product or gadget or a new user experience, your chances of obtaining authentic data are very low. Here are the questions you should ask yourself to find out if you should use A/B testing:
Can the experiment be run in a short time window — say 2 business cycles?Can I control the environment in which the experiment is being performed?Would I be able define measurable metrics for the experiment?Would I be able to gather enough data points to base my decision?ExampleMoving on to the example case study that we will be using from here on. There is an imaginary E-learning platform called Edufin.com which focuses specifically on Finance Courses. Edufin follows the following customer funnel:
Landing Page Visits — largest number of user interactions at the top of the funnel.Looking through different course offeringsCreating an AccountEnrolling in a courseCompleting a courseProblem Statement: Edufin is a little skeptical about the conversion rate from their existing landing page. They want to run an experiment with a new landing page and record some data points from there and to find out how the landing page performs with the audience.
Hence, our initial hypothesis here is: Rolling out a new landing page will enhance the conversion rate of the platform. Conversion rate here is the rate of student enrollment in the courses.We’ll be incorporating the remaining 4 steps in this example and try to design a robust experiment.
2. Choosing the right Business Metric
For any E-learning platform, the end goal is to have a large number of students completing the course. Can this be a right metric to evaluate the success/failure of our experiment? The answer is NO because completing a course may take weeks or months and we need to run this in a short time window.
An alternative is to record the number of clicks on the view course button on the landing page. The view course button points us to the second stage in the customer funnel. Here we can comment on the environment where the experiment is being run. There can be a case that we have more number of click in the experiment group but there can be a greater conversion rate in the control group.
Here is the above case, blue dots represent number of users who visited the landing page while yellow dots denote clicks on the page. Evidently, the number of visitors on the new page are more but the click through rate is more in the Control group(No. of visitors/No. of clicks). So, is this a good metric? Hold on, are these clicks unique? Let’s say there are 2 users who visited the page one didn’t click and the other clicked 4 times out of frustration, maybe the page loaded slowly. Here the CTR is 4/2 = 2. This is still misleading as it doesn’t tell us about the impact it’s going to create. Improving upon this, we can make use probability in this case, as the number of unique visitors who clicked out of the total number of unique visitors. This would give us 0.5 for the case we discussed.
Hence, click through probability is the metric which best suits the experiment.General rule of thumb, you use rate when you want to test the usability of a feature or product and probability when you want to find out the impact it can create.Thus, the metric you decide should
fit the timespanprovide statistically reliable data pointshelp us achieve the end goal.
3. Statistical Review
A/B testing is a type of statistical hypothesis testing and in this case, a prediction is made that Landing Page #B will perform better than Landing Page #A, and then data sets from both pages are observed and compared to determine if #B is a statistically significant enhancement over #A.
Not to forget, the agenda is not to test which Page performs better but to find out how the landing page will perform with the targeted audience.Statistical analysis allows us to use information we know to predict outcomes we don’t know with a reasonable level of accuracy.
To start with any dataset, you should ask about the structure of that data. This tells us about the variability is the data. We have several types of distributions in statistics which can help us set some guidelines about how variable is our data?
We know that we have discrete data(and not continuous data) here which has two of success(click) or failure(no click). So, we have a pretty solid case to use Binomial distribution for our data. Choosing the right distribution is half the battle won. Using the sample standard error for the binomial to estimate how variable we expect our overall CTP to be. When we talk about variability, what we actually mean is that for 95% confidence interval(industry standard), if we theoretically repeated the experiment over and over again we would expect that the interval we construct around our sample mean to cover the true value in the population 95% of the time.
While running the experiment, we are making a hypothesis that Variation B will have a higher CTP for our overall population than Variation A will. Instead of displaying both pages to all 100,000 visitors, we display them to a sample instead and observe what happens.
If Variation A (the original) had a better CTP with our sample of visitors, then no further action is needed to be taken as Variation A is already our permanent page.If Variation B had a better CTP, then we need to determine whether the enhancement was statistically significant for us to conclude that the change would be reflected in the larger population and thus rolling out the new landing page Variation B.Moving forward, we need to take care of different types of errors which arise due to the variability of the data. Here comes, null hypothesis which is a baseline assumption that there is no relationship between two data sets. Hence, the theory in our case is that there is no difference in click-through-probability between our control(variation A) and our experiment(variation B).
For A/B testing, it means that we can automatically assume that the new landing page is not going to generate more leads and that original landing page is the what we should move forward with.
Defining statistical significanceSo let’s say that Variation B performs better in our sample. How do we know whether or not that improvement will translate to the overall audience? How do we avoid making an error?
The answer is Statistical Significance.We establish statistical significance in our A/B test when we can say with 95% certainty that the increase in Variation B’s CTP falls outside the expected range of sample variability. Sample variability is defined as the variation in click-through-probabilities when the same variation is displayed to the same sample population.
Statistical significance is directly related to 2 variables:
p-value: or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H0) of a study question is trueSignificance level: The significance level is the probability of rejecting the null hypothesis given that it is true. The industry standard is 5% as mentioned above.As long as the p-value is less than the significance level, we have statistical significant proposition.
4. Design
One of the pre-requisites of A/B testing experiment is that we need to run it in a controlled environment. This can be translated as given that we have control over the page views going in our control and experiment, we’ll have to decide how many page views can produce statistically significant results. This is called statistical power i.e. if we see something interesting, we should have enough power to conclude that the interesting thing is statistically significant.
The amount of time required for a reliable test will vary depending on factors like your conversion rates, and how much traffic your website gets; a good testing tool should tell you when you’ve gathered enough data to draw a reliable conclusion.
Power is inversely proportional to the size of the feature or variation. The smaller the change you want to detect or the increase in confidence you wish to have in the results, the larger the test needs to be. Thus, we’ll be needing more page views to have more statistical power to conclude our results.
5. Analyzing results
Analysing the results include the statistical mathematics to be done on the data collected. Here are the steps to do that:
Record the number of page views in both control and experiment groups, say P₁ and P₂ and the number of clicks on each as C₁ and C₂.Don’t trust your eyes as it might appear that one page has more clicks than the other. Start calculating the confidence interval for the difference.Calculate the pooled probability: P = (C₁ + C₂) / (P₁ + P₂).Calculate the pooled standard error: SE = √P(1-P)((1/C₁)+(1/C₂))Calculate the estimated difference as diff = experimental probability-control probability.Calculate the margin of error for the experiment which is SE * 1.96(Z-score for confidence level 95%).The lower bound of the confidence interval is (diff — margin of error) and the upper bound is (diff + margin of error).Based on the statistical significance level and the confidence interval values, you can decide if you should launch the new landing page or not.ConclusionIf you have a website, you have actions that you want your users to complete (e.g., make a purchase, sign up for a newsletter) and/or metrics that you want to improve (e.g., revenue, session duration, bounce rate). With A/B testing, you can test which version of a landing page results in the greatest improvement in conversions (i.e. completed activities that you measure as goals) or metric value.
Here’s a google guide to configure an experiment:
Configure & Modify Experiments
At any time, you can have up to 12 experiments per view that are running, in the process of being setup, or that are…support.google.comEssentially, the sample size, the time window for the test and the metrics must be decided on in advance. Statistical significance should not be the sole criterion to stop the test or your results might be meaningless. The test results will keep oscillating between statistically significant and not being significant. Significance should become a point of discussion when the test reaches the endpoint.
The Return On Investment of A/B testing can be massive, as even small changes on a landing page or website can result in significant increases in leads generated, sales and revenue. All you need to do is to be careful while designing the experiment and deciding the metric right.
Now that we’re all on the same page* regarding what we can achieve using A/B testing, it’s time to test your innovative ideas and see them flourish.
T&C* — If you’re not on the same page, please drop a comment below and I’ll be happy to answer you!