Isn't cutting off at some confidence level on the 'do not do this' list for A/B testing? So far I understood that you test until you reach a predefined number of conversions, and then you check your significance to see if the result you obtained is valid. Not until you 'hit confidence'.
That's a blog post that I should get around to writing a rebuttal to some day. Because it is widely quoted and off base.
In theoretical frequentist math world, it is correct. If you peek repeatedly, eventually you'll come to confidence when there is no difference. Back in the real world, it is perfectly acceptable to use a strategy like, "We'll set a really high confidence (eg 99.5%) for cut-off until we get to a couple of thousand successes, and then we'll drop our standards substantially (eg 95% cut-off). If we are forced to stop for business reasons, we'll choose whatever happens to be ahead at the moment."
And yes, I can use Bayesian statistics to demonstrate that following the strategy that I describe creates acceptably low probabilities of making somewhat wrong business decisions, while allowing you to make good business decisions more quickly. And in practice people can follow it without needing a strong statistical background. (If I did enough work I could come up with a sophisticated optimal curve to use in making decisions. But I have not done that work, and in practice explaining it would be more work than it is worth.)
Why is this? Two reasons.
The first is that you only really get "independent peeks" at different orders of magnitude of data. Thus if you wait until you're past a small amount of data, you don't get a strong "repeated looks" effect.
Secondly coming to the wrong decision only matters to a business when the chosen option is substantially worse. If you follow a rule like what I gave, your odds of accidentally making up your mind in the wrong way if there is a business-significant difference are surprisingly low. For instance if you would detect a 2% difference as significant, and there is a real underlying difference of 1%, the odds that you're making the right decision right now deciding at a 95% confidence level is 99.2%. And if the real difference is a 0.5% win, your odds of making the right decision right now is 91.5%. (This despite the fact that you'd expect to need 16x as much data in order to even have a good chance of detecting a 0.5% win!)
Thus the decisions that you're making are usually correct. And on the occasions where you make the wrong choice, the mistake is usually not materially worse.
Great list Ben! In my experience of analyzing and commenting on A/B test results of our (Visual Website Optimizer) customers, one of the most important effects I have observed is the newness effect of variations. Visitors are sometimes inclined to respond positively to a new variation just because it's a change from plain, old boring control. Even though the newness effect fades in after a couple of days, but it makes our customers prone to celebrating early the success of their test. That is why we always recommending calculating a preset number of days for which to run the test before even looking at the result (we have a setting for that). And as a general rule of thumb, we ask our customers to at least wait for 7 days before concluding anything from the results.
The newness effect can be a really confusing PITA. Yet another strategy to use for it for long-running tests is cohort analysis - analyze what people's actions were starting x days after entering the test.
That said, if you turn it around, deliberately finding ways to use the newness effect to your advantage can be very rewarding.
As I suggested in http://bentilly.blogspot.com/2012/09/ab-testing-vs-mab-algor... you can use a MAB algorithm with a forgetting factor for this. I've actually done it with a more sophisticated system, but the idea is similar - boost whatever you have evidence is performing better right now. (You should put in logic to avoid time-based fluctuations in performance throwing you off.)
The one that I built had minor teething pains, but it is now working quite well. And it is convenient to be able to every so often drop in new versions, and remove ones which have proven unable to perform up to snuff even with a newness effect, and just trust that it will Do The Right Thing.
1) Be careful about what you're actually testing. More clicks or conversions is quite likely good, but over the long-term you want to track LTV. An effort that gets a lot of low-LTV customers will have great metrics at launch but will be disappointing down the road.
2) Consider segmenting your traffic on appropriate dimensions (campaign, referrer, previous behavior, platform, day of week, time of day), because it's common for a change to be amazingly effective (or ineffective), but only for one segment. Those opportunities can be lost if you treat your customer-base as a single segment.
Good points, but those would be improvements for people who have more volume.
As I indicated, tests of revenue (including LTV) are significantly harder to do. Also segmenting data makes it more work to interpret, and means you need to collect more data to draw conclusions.
If you have the data, by all means do it. If not, then try not to be too unhappy that you can't do it yet.
I'd argue that if you have enough traffic to look for 1-2% improvements, you have enough traffic to do basic segmentation and to start measuring value.
That said, I agree with your greater point that you always need to make sure your tactics are aligned with your situation and strategy. For instance, if you're still seeking product/market fit, don't worry about micro-optimizations; just follow best practices for the big items and optimize them later.
See for instance: http://www.evanmiller.org/how-not-to-run-an-ab-test.html
Is there anybody here that can shed some light on this?