AB Testing Final Project by Michael Eryan

Introduction.

The Why. Why do we want to change anything? Company wants to improve the student experience and to allow the instructors to support those students who are more likely to complete the free trial and make at least one payment.

The goal is to allow the potential students to pre-select themselves before committing to the free-trial so that the trial cancellation rates decrease. If the treatment works, there will be a smaller number of uncommitted students participating in the trial thus saving time and effort for the instructors, therefore, saving money for Company. In this study we are not concerned about what happens to students who proceed beyond their initial payment.

Product Choices

Website visitors have two choices:

“Start free trial” (14 days for free, then automatically charged a fee if not canceled)
“Review course materials” (study on own without instructors)

Experiment (the treatment - the difference between A and B groups)

The visitors in the experiment group see a popup that asks “Check that you’re ready” after they click on the “Start free trial button.” This is the “screener question” that asks the visitor to enter the amount of time in hours he is willing to commit to the course. If the visitor enters fewer than 5 hours, a warning message displays suggesting that he reconsider enrolling in the free trial and instead proceed to “access course materials” at his own leisure.

The experiences of the visitors in the two groups are otherwise identical. The only thing that can affect a visitor’s behavior in the experiment group is the “screener question” and the warning message.

Hypothesis

Test designers hypothesize that of those students who see the warning message, a smaller portion would enroll in the free trial because some of them would take the advice and decide not to enroll. Because of this self-selection, the enrollment rate is expected to be lower among those visitors who saw the warning message than for those who did not see it. This would also translate into a higher rate of free trial completion and, therefore, paid enrollment (at least one payment) in the course.

Unit of diversion.

Unregistered visitors: unique (for the day) cookie visits to the course page

Registered visitors: user-id

Available Variables

Pageviews, Clicks, Enrollments, Payments (for more detail see Metric Choice section below)

Time Scale

Daily data

Note that the Payment data was collected not on the date shown in the table but 14 days later at the end of the trial. Because of this it is meaningless to have both Enrollment and Payment data for the last 14 days of the experiment.

Conversion Funnel Description

A person visits the course overview page and is tracked by a cookie.
The visitor can do one of three things: leave the website (majority of cases), click on “access course materials” or click on “start free trial.” We are interested only in the latter option.
When the visitor clicks “start free trial,” there is a chance he will see the question (“Check that you’re ready”) - this is the diversion into control and experiment groups.

A: Control. The visitor who does not get the question is simply asked to register, submit his credit card information and then proceed with the trial. Once registered, the visitor gets a user ID and is subsequently tracked by it.
B: Experiment. The visitor who gets the question has to input the number of hours per week he is willing to commit to learning. If the number equals 5 or more hours, then he is taken to registration. If the number is less than 5 hours, then the visitor gets the warning message that a greater commitment is recommended by the instructors. At this point the visitor has a choice to proceed with registering for the trial or take the “access course materials” option. This is where the treatment is supposed to take effect. The experiment designers hypothesize that some of the lesser committed visitors who get the warning will reconsider and decide not to take the free trial.

Available Metrics

Page views - Number of unique cookies to view the page (unique visitors for the day)
Clicks - Number of unique cookies to click the “start free trial” button thereby diverging into Experiment or Control groups
Click-through-probability: 2) / 1)
Enrollments - Number of user-id’s - number of users who enroll in the trial (not just user_id registrations)
Gross Conversion (GC): 4) / 2)
Payments - Payers: number of user-id’s who do not cancel the free trial and make at least one payment
Retention (RE): 6) / 4)
Net Conversion (NC): 6) / 2)

These variables either are already available in the data or can be calculated for both Control and Experiment groups. Note that we do not have a variable specifically for free trial completion - we use the Payments variable in its place.

Experiment Design

Metric Choice

Evaluation Metrics

Initial List: Gross conversion, Net conversion, Retention

The main evaluation metric is Gross Conversion rate because our main goal is to reduce the number of enrollments by less committed students. We expect that the warning message will discourage some of such students from enrolling. To avoid scale differences we calculate the GC from the data for both experiment and control groups. We expect the GC to be lower in the experiment group. We will test that the difference in GC between the two groups is more than the practical significance of 1% (absolute difference - percentage point) and statistically significant at alpha=0.05.

Retention could be impacted. If the treatment does not affect committed students who will make at least one payment, the retention rate is expected to increase because of the decrease of enrollments and static number of payments.

Net Conversion could also impacted. It is conceivable that some of those discouraged by the screener from enrolling could have completed the course. But, hopefully, this metric was not significantly impacted and the number of students who continue past trial and complete the course remains the same.

Invariant Metrics

Cookies (page views), Clicks, Click-through Probability

The unit of diversion is a cookie, so I expect to see about the same number of page views, clicks and, therefore, the same click-through probability between the two groups. It is only after the visitors click on the button that they experiences start to diverge.

Measuring Variability

The unit of diversion in our study is a cookie, therefore, the metrics that cookie in their denominator will have similar analytical and empirical estimates of the variability. This is the case with Gross and Net Conversion rates but is not with Retention. For Retention the analytical estimate is likely to be an underestimate of its variability and an empirical estimate would lead to more accurate analysis. Using the baseline values provided and the sample size of 5,000 I calculated the variability analytically to be:

## [1] "Standard errors for Gross Conversion, Net Conversion, and Retention respectively"

## [1] "0.0202 0.0156 0.0549"

Standard Error Gross Conversion rate: 0.0202

Standard Error Net Conversion rate: 0.0156

Standard Error Retention rate: 0.0549

(Note that the standard error is the standard deviation of the mean - and the mean in these cases is the probability.)

Sizing - Power Analysis

Number of Samples given the desired Confidence and Power.

We need to achieve confidence of 95% (alpha = 0.05) and power of 80% (beta = 0.2) in our test.

Since the main metric Gross Conversion rate and the other evaluation metrics are correlated with it I will not be using the Bonferroni correction here. Bonferroni correction applies if we test multiple independent metrics and make our decision if at least one metric is significant.

Gross Conversion. The baseline GC rate is 20.625% and the smallest difference we are interested in detecting is 1%. After plugging these two numbers into the calculator I got the size of each sample of 25,835 clicks. Since the denominator is clicks, I need to scale this by the click-through-probability to get the actual number of required page views. Total number of page views required: (25,835 * 2) / 0.08 = 645,875.

Net Conversion. Similarly for Net Conversion - plug 10.931125% and 0.0075 to get 27,413 clicks in each sample.Total number of page views required: (27,413 * 2) / 0.08 = 685,325.

Retention. For retention - plug 53% and 1% (you can already guess the result will be higher than for the metrics above) to get 39,087 enrollments which translate to - (39,087 * 2) / (0.08 * 0.20625) = 4,737,818 pageviews. Notice how the requirements for GC and NC are fairly reasonable but not so for retention. Now that I know this, I would recommend not to use retention as an evaluation metric and focus on gross and net conversion rates.

Duration vs Exposure

Net conversion rate had a slightly higher requirement of 685,325 page views so it is our limiting factor. Since we know that the average daily number of the unique page views is 40,000, even if we used all the traffic in our test and control, we would need 18 days to get to this volume.

Standard Error Retention rate: 0.0549 It is, however, unwise to expose 50% of the traffic to the treatment. Yes, it is true that the experiment itself is low risk, but we have to remember that any time our engineers modify code a new bug might be introduced. In fact, I would recommend, to randomly expose only 20% of the traffic to the treatment and assign another 20% of the traffic to the control group. This way if anything goes wrong only 20% of the traffic will be at risk. That said, this would mean that there will be only 8,000 page views per day exposed to the treatment and the duration of the experiment will be 43 days.

Experiment Analysis

## [1] "Summary Statistics for the Control Data:"

##      date             pageviews         clicks     enrollments   
##  Length:37          Min.   : 7434   Min.   :632   Min.   :110.0  
##  Class :character   1st Qu.: 8896   1st Qu.:708   1st Qu.:146.5  
##  Mode  :character   Median : 9420   Median :759   Median :162.0  
##                     Mean   : 9339   Mean   :767   Mean   :164.6  
##                     3rd Qu.: 9871   3rd Qu.:825   3rd Qu.:175.0  
##                     Max.   :10667   Max.   :909   Max.   :233.0  
##                                                   NA's   :14     
##     payments           ctp                gc               re        
##  Min.   : 56.00   Min.   :0.07134   Min.   :0.1677   Min.   :0.3252  
##  1st Qu.: 70.00   1st Qu.:0.07960   1st Qu.:0.1903   1st Qu.:0.4671  
##  Median : 91.00   Median :0.08291   Median :0.1952   Median :0.5545  
##  Mean   : 88.39   Mean   :0.08213   Mean   :0.2204   Mean   :0.5398  
##  3rd Qu.:102.50   3rd Qu.:0.08412   3rd Qu.:0.2378   3rd Qu.:0.5926  
##  Max.   :128.00   Max.   :0.08896   Max.   :0.3269   Max.   :0.7273  
##  NA's   :14                         NA's   :14       NA's   :14      
##        nc         
##  Min.   :0.07646  
##  1st Qu.:0.10062  
##  Median :0.11076  
##  Mean   :0.11827  
##  3rd Qu.:0.13108  
##  Max.   :0.18524  
##  NA's   :14

## [1] "Summary Statistics for the Experimental Data:"

##      date             pageviews         clicks       enrollments   
##  Length:37          Min.   : 7664   Min.   :642.0   Min.   : 94.0  
##  Class :character   1st Qu.: 8881   1st Qu.:722.0   1st Qu.:127.0  
##  Mode  :character   Median : 9359   Median :770.0   Median :142.0  
##                     Mean   : 9315   Mean   :765.5   Mean   :148.8  
##                     3rd Qu.: 9737   3rd Qu.:827.0   3rd Qu.:172.0  
##                     Max.   :10551   Max.   :884.0   Max.   :213.0  
##                                                     NA's   :14     
##     payments           ctp                gc               re        
##  Min.   : 34.00   Min.   :0.07413   Min.   :0.1442   Min.   :0.3237  
##  1st Qu.: 69.00   1st Qu.:0.07985   1st Qu.:0.1639   1st Qu.:0.4905  
##  Median : 91.00   Median :0.08272   Median :0.1779   Median :0.5659  
##  Mean   : 84.57   Mean   :0.08219   Mean   :0.1996   Mean   :0.5731  
##  3rd Qu.: 99.00   3rd Qu.:0.08435   3rd Qu.:0.2361   3rd Qu.:0.6634  
##  Max.   :123.00   Max.   :0.08891   Max.   :0.2843   Max.   :0.7845  
##  NA's   :14                         NA's   :14       NA's   :14      
##        nc         
##  Min.   :0.04956  
##  1st Qu.:0.09070  
##  Median :0.11298  
##  Mean   :0.11337  
##  3rd Qu.:0.13856  
##  Max.   :0.17036  
##  NA's   :14

Observations. First thing I notice is that each group had an average of about 9k pageviews per day. Since the total traffic was 40k unique cookies per day, looks like the designers dedicated 50% of traffic to this experiment while my recommendation was only 40%.

Sanity Checks

My invariant vars are Page views (Cookies), Clicks and Click-through probability

Pageviews and Clicks I can test by my own “f_binom” function while the Click-through probability I will test by the “f_prop” function.

## [1] "Sanity check on page views"

## $observed.number.successes
## [1] 345543
## 
## $prop.successes
## [1] 0.5006397
## 
## $ci_num
## [1] "344287 345916"
## 
## $ci_prop
## [1] 0.4988 0.5012
## 
## $outside.conf.int.reject.null
## [1] "FALSE FALSE"

## [1] "Good, the observed number is not outside the confidence interval - sanity check passed, our test implementation and data collection seems OK."

## [1] "Sanity check on clicks"

## $observed.number.successes
## [1] 28378
## 
## $prop.successes
## [1] 0.5004673
## 
## $ci_num
## [1] "28118 28585"
## 
## $ci_prop
## [1] 0.4959 0.5041
## 
## $outside.conf.int.reject.null
## [1] "FALSE FALSE"

## [1] "Same here - sanity check on clicks passed. Since both pageviews and clicks are equal across the groups, the click-through probability has to be equal as well. But we can check the proportions as well with a similiar test."

## [1] "Sanity check on Click-through probability"

## $observed.pro_diff
## [1] 1e-04
## 
## $pooled.pro
## [1] 0.0822
## 
## $pooled.se
## [1] 7e-04
## 
## $ci_pro.null.0
## [1] -0.0012  0.0014
## 
## $outside.conf.int.reject.null
## [1] FALSE

## [1] "OK. The difference in proportions is not outside the confidence interval."

All sanity checks passed. The fact that the number of pageviews, clicks and, therefore, click-through probability do not differ significantly between the two groups means that the cookies were assigned randomly to the groups.

Also, above I noted that about 50% of the traffic was dedicated to this AB test. But was this portion of the traffic representative of the whole population? It is important to confirm this first before we make generalizations about the whole population of our visitors.

I will test this by comparing the baseline click-through probability to the observed values.

## $observed.pro_diff
## [1] 0.0022
## 
## $pooled.pro
## [1] 0.082
## 
## $pooled.se
## [1] 0.0014
## 
## $ci_pro.null.0
## [1] -0.0006  0.0049
## 
## $outside.conf.int.reject.null
## [1] FALSE

Yes, the combination of the AB samples looks similiar to the whole population in terms of the click-through probability. So, it is safe for us to generalize the conclusions from this test to the whole population.

Result Analysis

Evaluation metrics are Gross and Net conversion, Retention

## [1] "Enrollments for Test and Control groups:      3423 3785"

## [1] "Clicks for Test and Control groups:           17260 17293"

## [1] "Gross Conversion for Test and Control groups: 0.1983 0.2189"

## [1] "Analysis of Gross Conversion = Enrollments / Clicks"

## $observed.pro_diff
## [1] -0.0206
## 
## $pooled.pro
## [1] 0.2086
## 
## $pooled.se
## [1] 0.0044
## 
## $ci_pro.null.0
## [1] -0.0291 -0.0120
## 
## $outside.conf.int.reject.null
## [1] TRUE

## [1] "The GC is 2% (percentage points) lower for the experimental group while the d_min=1%. Since the confidence interval does not include zero (or we can also say that observed difference is outside of margin of error), we can reject the null hypothesis at 95% confidence level. This means that our result is both statistically and practically significant - our experiment succeeded in reducing the number of enrollments."

## [1] "Sign test for Gross Conversion rate should confirm our result."

## 
##  Exact binomial test
## 
## data:  successgc and total
## number of successes = 19, number of trials = 23, p-value = 0.0013
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.6450676 1.0000000
## sample estimates:
## probability of success 
##               0.826087

## [1] "The odds that we'd get 19 out of 23 successes are pretty low (p-value = 0.0013) this confirms our treatment worked."

## [1] "Similiar test for Net Conversion = Payments / Clicks"

## $observed.pro_diff
## [1] -0.0049
## 
## $pooled.pro
## [1] 0.1151
## 
## $pooled.se
## [1] 0.0034
## 
## $ci_pro.null.0
## [1] -0.0116  0.0019
## 
## $outside.conf.int.reject.null
## [1] FALSE

## [1] "Not significant (p-value = 0.3388) - suggests that our 'screener' did not really discourage potentially paying students. Good, that means we did not lose any real business while we discouraged some reasonable people from enrolling after they saw our 'screener.'"

## [1] "Sign test for Net Conversion rate"

## 
##  Exact binomial test
## 
## data:  successnc and total
## number of successes = 13, number of trials = 23, p-value = 0.3388
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.3753936 1.0000000
## sample estimates:
## probability of success 
##              0.5652174

## [1] "Not significant."

## [1] "Finally, for the Retention rate = Payments / Enrollments"

## $observed.pro_diff
## [1] 0.0311
## 
## $pooled.pro
## [1] 0.5519
## 
## $pooled.se
## [1] 0.0117
## 
## $ci_pro.null.0
## [1] 0.0081 0.0541
## 
## $outside.conf.int.reject.null
## [1] TRUE

## [1] "Retention significant as well - meaning a higher proportion of those enrolled became paying students, which is what we want."

## [1] "Interesting, from the results of power analysis I concluded that we would not be able to detect the minimum absolute difference of 1% for the retention rate. Note, however, that in our experiment we got a 3%  difference and it was found to be significant. So, I re-ran the power analysis for 3% and found the necessary number of pageviews = (4,338 * 2) * 0.08000 *  0.20625  =  525,818. This is actually less than the total number of pageviews in our two samples - 690,203, so it makes sense now why even with our samples we detected a significant difference of about 3%."

## [1] "Sign test for Retention rate"

## 
##  Exact binomial test
## 
## data:  successre and total
## number of successes = 13, number of trials = 23, p-value = 0.3388
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
##  0.3753936 1.0000000
## sample estimates:
## probability of success 
##              0.5652174

## [1] "But the sign test is not significant (p-value = 0.3388). Oh well, this is still a bonus as I did not expect to able to measure retention at all."

Summary

I did not use the Bonferroni correction because it only applies if we test multiple independent metrics and make our decision if at least one metric is significant. In our case the only metric that I was interested was the Gross Conversion Rate. The effect size hypothesis test and sign test concurred (p-value of 0.0013). The difference in GC between the experiment and control groups was both statistically and practically significant. I tested Net Conversion and Retention as well but their results do not influence my recommendation.

Make a Recommendation

Since the treatment succeeded in decreasing the Gross Conversion, I would recommend to launch the test in production. There is good evidence that the “screener” helped to exclude some of the students who would cancel the trial. Additionally, the net conversion rate was not significantly impacted and the retention actually increased thus suggesting that the “screener” did not impact committed students who became paying customers.

Follow-Up Experiment: How to Reduce Early Cancellations

Company could conduct a wider and longer test like this one. Company could raise the pre-selection bar even further. For example, we could increase the required commitment from 5 to 10 hours. We could introduce other barriers such as an assessment survey or a quiz that would discourage uncommitted students from enrolling in the trial.