The Why. Why do we want to change anything? Company wants to improve the student experience and to allow the instructors to support those students who are more likely to complete the free trial and make at least one payment.
The goal is to allow the potential students to pre-select themselves before committing to the free-trial so that the trial cancellation rates decrease. If the treatment works, there will be a smaller number of uncommitted students participating in the trial thus saving time and effort for the instructors, therefore, saving money for Company. In this study we are not concerned about what happens to students who proceed beyond their initial payment.
Website visitors have two choices:
“Start free trial” (14 days for free, then automatically charged a fee if not canceled)
“Review course materials” (study on own without instructors)
The visitors in the experiment group see a popup that asks “Check that you’re ready” after they click on the “Start free trial button.” This is the “screener question” that asks the visitor to enter the amount of time in hours he is willing to commit to the course. If the visitor enters fewer than 5 hours, a warning message displays suggesting that he reconsider enrolling in the free trial and instead proceed to “access course materials” at his own leisure.
The experiences of the visitors in the two groups are otherwise identical. The only thing that can affect a visitor’s behavior in the experiment group is the “screener question” and the warning message.
Test designers hypothesize that of those students who see the warning message, a smaller portion would enroll in the free trial because some of them would take the advice and decide not to enroll. Because of this self-selection, the enrollment rate is expected to be lower among those visitors who saw the warning message than for those who did not see it. This would also translate into a higher rate of free trial completion and, therefore, paid enrollment (at least one payment) in the course.
Unregistered visitors: unique (for the day) cookie visits to the course page
Registered visitors: user-id
Pageviews, Clicks, Enrollments, Payments (for more detail see Metric Choice section below)
Daily data
Note that the Payment data was collected not on the date shown in the table but 14 days later at the end of the trial. Because of this it is meaningless to have both Enrollment and Payment data for the last 14 days of the experiment.
A person visits the course overview page and is tracked by a cookie.
The visitor can do one of three things: leave the website (majority of cases), click on “access course materials” or click on “start free trial.” We are interested only in the latter option.
When the visitor clicks “start free trial,” there is a chance he will see the question (“Check that you’re ready”) - this is the diversion into control and experiment groups.
These variables either are already available in the data or can be calculated for both Control and Experiment groups. Note that we do not have a variable specifically for free trial completion - we use the Payments variable in its place.
Initial List: Gross conversion, Net conversion, Retention
The main evaluation metric is Gross Conversion rate because our main goal is to reduce the number of enrollments by less committed students. We expect that the warning message will discourage some of such students from enrolling. To avoid scale differences we calculate the GC from the data for both experiment and control groups. We expect the GC to be lower in the experiment group. We will test that the difference in GC between the two groups is more than the practical significance of 1% (absolute difference - percentage point) and statistically significant at alpha=0.05.
Retention could be impacted. If the treatment does not affect committed students who will make at least one payment, the retention rate is expected to increase because of the decrease of enrollments and static number of payments.
Net Conversion could also impacted. It is conceivable that some of those discouraged by the screener from enrolling could have completed the course. But, hopefully, this metric was not significantly impacted and the number of students who continue past trial and complete the course remains the same.
Cookies (page views), Clicks, Click-through Probability
The unit of diversion is a cookie, so I expect to see about the same number of page views, clicks and, therefore, the same click-through probability between the two groups. It is only after the visitors click on the button that they experiences start to diverge.
The unit of diversion in our study is a cookie, therefore, the metrics that cookie in their denominator will have similar analytical and empirical estimates of the variability. This is the case with Gross and Net Conversion rates but is not with Retention. For Retention the analytical estimate is likely to be an underestimate of its variability and an empirical estimate would lead to more accurate analysis. Using the baseline values provided and the sample size of 5,000 I calculated the variability analytically to be:
## [1] "Standard errors for Gross Conversion, Net Conversion, and Retention respectively"
## [1] "0.0202 0.0156 0.0549"
Standard Error Gross Conversion rate: 0.0202
Standard Error Net Conversion rate: 0.0156
Standard Error Retention rate: 0.0549
(Note that the standard error is the standard deviation of the mean - and the mean in these cases is the probability.)
Number of Samples given the desired Confidence and Power.
We need to achieve confidence of 95% (alpha = 0.05) and power of 80% (beta = 0.2) in our test.
Since the main metric Gross Conversion rate and the other evaluation metrics are correlated with it I will not be using the Bonferroni correction here. Bonferroni correction applies if we test multiple independent metrics and make our decision if at least one metric is significant.
Gross Conversion. The baseline GC rate is 20.625% and the smallest difference we are interested in detecting is 1%. After plugging these two numbers into the calculator I got the size of each sample of 25,835 clicks. Since the denominator is clicks, I need to scale this by the click-through-probability to get the actual number of required page views. Total number of page views required: (25,835 * 2) / 0.08 = 645,875.
Net Conversion. Similarly for Net Conversion - plug 10.931125% and 0.0075 to get 27,413 clicks in each sample.Total number of page views required: (27,413 * 2) / 0.08 = 685,325.
Retention. For retention - plug 53% and 1% (you can already guess the result will be higher than for the metrics above) to get 39,087 enrollments which translate to - (39,087 * 2) / (0.08 * 0.20625) = 4,737,818 pageviews. Notice how the requirements for GC and NC are fairly reasonable but not so for retention. Now that I know this, I would recommend not to use retention as an evaluation metric and focus on gross and net conversion rates.
Net conversion rate had a slightly higher requirement of 685,325 page views so it is our limiting factor. Since we know that the average daily number of the unique page views is 40,000, even if we used all the traffic in our test and control, we would need 18 days to get to this volume.
Standard Error Retention rate: 0.0549 It is, however, unwise to expose 50% of the traffic to the treatment. Yes, it is true that the experiment itself is low risk, but we have to remember that any time our engineers modify code a new bug might be introduced. In fact, I would recommend, to randomly expose only 20% of the traffic to the treatment and assign another 20% of the traffic to the control group. This way if anything goes wrong only 20% of the traffic will be at risk. That said, this would mean that there will be only 8,000 page views per day exposed to the treatment and the duration of the experiment will be 43 days.
## [1] "Summary Statistics for the Control Data:"
## date pageviews clicks enrollments
## Length:37 Min. : 7434 Min. :632 Min. :110.0
## Class :character 1st Qu.: 8896 1st Qu.:708 1st Qu.:146.5
## Mode :character Median : 9420 Median :759 Median :162.0
## Mean : 9339 Mean :767 Mean :164.6
## 3rd Qu.: 9871 3rd Qu.:825 3rd Qu.:175.0
## Max. :10667 Max. :909 Max. :233.0
## NA's :14
## payments ctp gc re
## Min. : 56.00 Min. :0.07134 Min. :0.1677 Min. :0.3252
## 1st Qu.: 70.00 1st Qu.:0.07960 1st Qu.:0.1903 1st Qu.:0.4671
## Median : 91.00 Median :0.08291 Median :0.1952 Median :0.5545
## Mean : 88.39 Mean :0.08213 Mean :0.2204 Mean :0.5398
## 3rd Qu.:102.50 3rd Qu.:0.08412 3rd Qu.:0.2378 3rd Qu.:0.5926
## Max. :128.00 Max. :0.08896 Max. :0.3269 Max. :0.7273
## NA's :14 NA's :14 NA's :14
## nc
## Min. :0.07646
## 1st Qu.:0.10062
## Median :0.11076
## Mean :0.11827
## 3rd Qu.:0.13108
## Max. :0.18524
## NA's :14
## [1] "Summary Statistics for the Experimental Data:"
## date pageviews clicks enrollments
## Length:37 Min. : 7664 Min. :642.0 Min. : 94.0
## Class :character 1st Qu.: 8881 1st Qu.:722.0 1st Qu.:127.0
## Mode :character Median : 9359 Median :770.0 Median :142.0
## Mean : 9315 Mean :765.5 Mean :148.8
## 3rd Qu.: 9737 3rd Qu.:827.0 3rd Qu.:172.0
## Max. :10551 Max. :884.0 Max. :213.0
## NA's :14
## payments ctp gc re
## Min. : 34.00 Min. :0.07413 Min. :0.1442 Min. :0.3237
## 1st Qu.: 69.00 1st Qu.:0.07985 1st Qu.:0.1639 1st Qu.:0.4905
## Median : 91.00 Median :0.08272 Median :0.1779 Median :0.5659
## Mean : 84.57 Mean :0.08219 Mean :0.1996 Mean :0.5731
## 3rd Qu.: 99.00 3rd Qu.:0.08435 3rd Qu.:0.2361 3rd Qu.:0.6634
## Max. :123.00 Max. :0.08891 Max. :0.2843 Max. :0.7845
## NA's :14 NA's :14 NA's :14
## nc
## Min. :0.04956
## 1st Qu.:0.09070
## Median :0.11298
## Mean :0.11337
## 3rd Qu.:0.13856
## Max. :0.17036
## NA's :14
Observations. First thing I notice is that each group had an average of about 9k pageviews per day. Since the total traffic was 40k unique cookies per day, looks like the designers dedicated 50% of traffic to this experiment while my recommendation was only 40%.
My invariant vars are Page views (Cookies), Clicks and Click-through probability
Pageviews and Clicks I can test by my own “f_binom” function while the Click-through probability I will test by the “f_prop” function.
## [1] "Sanity check on page views"
## $observed.number.successes
## [1] 345543
##
## $prop.successes
## [1] 0.5006397
##
## $ci_num
## [1] "344287 345916"
##
## $ci_prop
## [1] 0.4988 0.5012
##
## $outside.conf.int.reject.null
## [1] "FALSE FALSE"
## [1] "Good, the observed number is not outside the confidence interval - sanity check passed, our test implementation and data collection seems OK."
## [1] "Sanity check on clicks"
## $observed.number.successes
## [1] 28378
##
## $prop.successes
## [1] 0.5004673
##
## $ci_num
## [1] "28118 28585"
##
## $ci_prop
## [1] 0.4959 0.5041
##
## $outside.conf.int.reject.null
## [1] "FALSE FALSE"
## [1] "Same here - sanity check on clicks passed. Since both pageviews and clicks are equal across the groups, the click-through probability has to be equal as well. But we can check the proportions as well with a similiar test."
## [1] "Sanity check on Click-through probability"
## $observed.pro_diff
## [1] 1e-04
##
## $pooled.pro
## [1] 0.0822
##
## $pooled.se
## [1] 7e-04
##
## $ci_pro.null.0
## [1] -0.0012 0.0014
##
## $outside.conf.int.reject.null
## [1] FALSE
## [1] "OK. The difference in proportions is not outside the confidence interval."
All sanity checks passed. The fact that the number of pageviews, clicks and, therefore, click-through probability do not differ significantly between the two groups means that the cookies were assigned randomly to the groups.
Also, above I noted that about 50% of the traffic was dedicated to this AB test. But was this portion of the traffic representative of the whole population? It is important to confirm this first before we make generalizations about the whole population of our visitors.
I will test this by comparing the baseline click-through probability to the observed values.
## $observed.pro_diff
## [1] 0.0022
##
## $pooled.pro
## [1] 0.082
##
## $pooled.se
## [1] 0.0014
##
## $ci_pro.null.0
## [1] -0.0006 0.0049
##
## $outside.conf.int.reject.null
## [1] FALSE
Yes, the combination of the AB samples looks similiar to the whole population in terms of the click-through probability. So, it is safe for us to generalize the conclusions from this test to the whole population.
Evaluation metrics are Gross and Net conversion, Retention
## [1] "Enrollments for Test and Control groups: 3423 3785"
## [1] "Clicks for Test and Control groups: 17260 17293"
## [1] "Gross Conversion for Test and Control groups: 0.1983 0.2189"
## [1] "Analysis of Gross Conversion = Enrollments / Clicks"
## $observed.pro_diff
## [1] -0.0206
##
## $pooled.pro
## [1] 0.2086
##
## $pooled.se
## [1] 0.0044
##
## $ci_pro.null.0
## [1] -0.0291 -0.0120
##
## $outside.conf.int.reject.null
## [1] TRUE
## [1] "The GC is 2% (percentage points) lower for the experimental group while the d_min=1%. Since the confidence interval does not include zero (or we can also say that observed difference is outside of margin of error), we can reject the null hypothesis at 95% confidence level. This means that our result is both statistically and practically significant - our experiment succeeded in reducing the number of enrollments."
## [1] "Sign test for Gross Conversion rate should confirm our result."
##
## Exact binomial test
##
## data: successgc and total
## number of successes = 19, number of trials = 23, p-value = 0.0013
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.6450676 1.0000000
## sample estimates:
## probability of success
## 0.826087
## [1] "The odds that we'd get 19 out of 23 successes are pretty low (p-value = 0.0013) this confirms our treatment worked."
## [1] "Similiar test for Net Conversion = Payments / Clicks"
## $observed.pro_diff
## [1] -0.0049
##
## $pooled.pro
## [1] 0.1151
##
## $pooled.se
## [1] 0.0034
##
## $ci_pro.null.0
## [1] -0.0116 0.0019
##
## $outside.conf.int.reject.null
## [1] FALSE
## [1] "Not significant (p-value = 0.3388) - suggests that our 'screener' did not really discourage potentially paying students. Good, that means we did not lose any real business while we discouraged some reasonable people from enrolling after they saw our 'screener.'"
## [1] "Sign test for Net Conversion rate"
##
## Exact binomial test
##
## data: successnc and total
## number of successes = 13, number of trials = 23, p-value = 0.3388
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.3753936 1.0000000
## sample estimates:
## probability of success
## 0.5652174
## [1] "Not significant."
## [1] "Finally, for the Retention rate = Payments / Enrollments"
## $observed.pro_diff
## [1] 0.0311
##
## $pooled.pro
## [1] 0.5519
##
## $pooled.se
## [1] 0.0117
##
## $ci_pro.null.0
## [1] 0.0081 0.0541
##
## $outside.conf.int.reject.null
## [1] TRUE
## [1] "Retention significant as well - meaning a higher proportion of those enrolled became paying students, which is what we want."
## [1] "Interesting, from the results of power analysis I concluded that we would not be able to detect the minimum absolute difference of 1% for the retention rate. Note, however, that in our experiment we got a 3% difference and it was found to be significant. So, I re-ran the power analysis for 3% and found the necessary number of pageviews = (4,338 * 2) * 0.08000 * 0.20625 = 525,818. This is actually less than the total number of pageviews in our two samples - 690,203, so it makes sense now why even with our samples we detected a significant difference of about 3%."
## [1] "Sign test for Retention rate"
##
## Exact binomial test
##
## data: successre and total
## number of successes = 13, number of trials = 23, p-value = 0.3388
## alternative hypothesis: true probability of success is greater than 0.5
## 95 percent confidence interval:
## 0.3753936 1.0000000
## sample estimates:
## probability of success
## 0.5652174
## [1] "But the sign test is not significant (p-value = 0.3388). Oh well, this is still a bonus as I did not expect to able to measure retention at all."
I did not use the Bonferroni correction because it only applies if we test multiple independent metrics and make our decision if at least one metric is significant. In our case the only metric that I was interested was the Gross Conversion Rate. The effect size hypothesis test and sign test concurred (p-value of 0.0013). The difference in GC between the experiment and control groups was both statistically and practically significant. I tested Net Conversion and Retention as well but their results do not influence my recommendation.
Since the treatment succeeded in decreasing the Gross Conversion, I would recommend to launch the test in production. There is good evidence that the “screener” helped to exclude some of the students who would cancel the trial. Additionally, the net conversion rate was not significantly impacted and the retention actually increased thus suggesting that the “screener” did not impact committed students who became paying customers.
Company could conduct a wider and longer test like this one. Company could raise the pre-selection bar even further. For example, we could increase the required commitment from 5 to 10 hours. We could introduce other barriers such as an assessment survey or a quiz that would discourage uncommitted students from enrolling in the trial.