A/B testing mistakes I learned the hard way
110 points
1 month ago
| 6 comments
| newsletter.posthog.com
| HN
light_hue_1
28 days ago
[-]
That's not Simpson's paradox!

> In fact, while the new flow worked great on mobile, conversion was lower on desktop – an insight we missed when we combined these metrics.

> This phenomenon is known as Simpson's paradox – i.e. when experiments show one outcome when analyzed at an aggregated level, but a different one when analyzed by subgroups.

There's nothing strange about finding out that some groups benefit and others lose out when diving up you data. You're looking at an average and some parts are positive and others are negative. Where's the paradox there?

Simpson's paradox is when more button presses lead to more purchases. But then you look at desktop vs mobile and you find out that for both desktop and mobile more clicks doesn't mean more purchases (or worse, more clicks means fewer purchases).

That's why it's a paradox. The association between two variables exists at the aggregate level but doesn't exist or is backwards when you split up the population. It's not a statement about the average performance of something.

I would add a 7th A/B testing mistake to that list and it's not learning about basic probability, statical tests, power, etc. Flying by the seat of your pants when statistics are involved always ends badly.

reply
gregbarbosa
20 days ago
[-]
> Simpson's paradox is when more button presses lead to more purchases. But then you look at desktop vs mobile and you find out that for both desktop and mobile more clicks doesn't mean more purchases (or worse, more clicks means fewer purchases).

How could more button presses lead to increased conversion rates while hiding this data when comparing desktop and mobile? Wouldn’t you see at least one device type demonstrating higher CVR to reflect aggregate CVR increase?

reply
light_hue_1
9 days ago
[-]
That's Simpson's paradox!

You can take data where as a whole presses lead to more purchases. Then split it into two halves (like mobile vs desktop) and show that on both halves presses lead to fewer purchases.

The whole paradox is that the intuition we have for averages doesn't apply to correlations.

I suggest checking out the Wikipedia page.

reply
iamacyborg
28 days ago
[-]
> I would add a 7th A/B testing mistake to that list and it's not learning about basic probability, statical tests, power, etc. Flying by the seat of your pants when statistics are involved always ends badly.

This is where most tests fail, in my experience.

Everyone wants to run A/B tests because that’s what the big co’s are doing and they want to look like the sort of person BigCo might hire, but they’re making silly mistakes because stats is hard and not taught well at school.

reply
eterm
28 days ago
[-]
Yep, I came here to say the same thing. The author has misremembered / misrepresented Simpson's paradox, which is much stronger than an aggregate hiding a group effect.
reply
a2128
28 days ago
[-]
I feel like I've too often seen in products new (anti)features that are way too easy to accidentally click, and whenever I do accidentally click I just imagine it's increasing some statistics counter that's ultimately showing the product managers super high engagement, clearly meaning the users must love it to be using it all the time
reply
Jemaclus
28 days ago
[-]
To your point, my company is doing some A/B tests and I insisted that we not just measure conversion ("it works") and additionally measure some metrics that would indicate that it works _well_. For example, if you have a carousel of products, and someone buys something from the carousel, then it "works," but it would work _better_ if the item they bought was the first thing on the carousel rather than the last. That indicates that we showed relevant products first, which is better than showing the relevant product last!

Sure, conversion would go up if it was in the first slot versus the last, but it takes effort (however little) to scroll through a carousel, so ensuring that we can measure the quality of the result and not just the quantity is really important.

This is one way I've tried to avoid the problem you describe. It's not enough that people can engage with the feature, but they need to engage with it meaningfully and in such a way that would encourage repeat behavior.

Another example of what you describe is that on our site if you search for "neon blue guitar", don't interact with the search results whatsoever, go back to the home page, click on a product on a carousel, and purchase _that_ product, it counts as a "successful search event," even though the search technically failed because they didn't interact with it in any meaningful way. To your point: PM is happy; user is not.

TL;DR it's really important to think through tests and how you measure success!

reply
taneq
28 days ago
[-]
> but it would work _better_ if the item they bought was the first thing on the carousel rather than the last

Depends, does that increase overall sales? Or is it ‘better’ to make the customer ‘walk past’ the other items to get to the thing they want (the way supermarkets make you walk up the back of the shop to get to the milk), and maybe buy something else too?

reply
Jemaclus
28 days ago
[-]
Absolutely! It's important to have a hypothesis and test it, to OP's point!
reply
thekoma
28 days ago
[-]
One thing I always accidentally click is the animation-expanding Google results when I return back to the search results after visiting one of the result pages, while trying to quickly visit the next result’s page.
reply
texuf
28 days ago
[-]
Then in the code there’s a bug that over/under reports those clicks (because ui is not procedural code that lends itself to straightforward metrics) and i think this could explain Spotify’s product decisions.
reply
ffhhj
28 days ago
[-]
Several years ago Google implemented this A/B feature in which you had to choose one image or another. Do you remember that one? Ofcourse I always chose the wrong one ;) It didn't last long.
reply
clarle
28 days ago
[-]
#2 is a slippery slope if you don't do it properly.

You might look end up looking at lots of different slices of your data, and you might come to the conclusion, "Oh, it looks like France is statistically significant negative on our new signup flow changes".

It's important to make sure you have a hypothesis for the given slice before you start the experiment and not just hunt for outliers after the fact, or otherwise you're just p-hacking [1].

[1]: https://en.wikipedia.org/wiki/p-hacking

reply
Dyac
28 days ago
[-]
Fundamentally, you can't use the same data to both generate and validate/disprove a hypothesis.

Srgmenting and data dredging is fine provided you run a new test with fresh data to validate if there is a causal relationship in any correlations found.

reply
lnenad
28 days ago
[-]
I agree, as per example and point number one, if your goals was to increase conversions, you were successful. You can then go to the next step, slice the data up, and iterate on another change. If you fall into the box of over-analyzing you will probably find all sorts of irrelevant patterns.
reply
sokoloff
28 days ago
[-]
I recall getting into a heated debate with an analyst at my company over the topic of "peeking" (he was right; I was wrong, but it took me several days to finally understand what he was saying.)

The temptation to "peek" and keep on peeking until the test confesses to the thing you want it to say is very high.

reply
jakevoytko
28 days ago
[-]
This is the most "damned if you do, damned if you don't" part of testing. I've found so many coding errors that weren't obvious until you looked at the day 2 or day 3 test results. "Hm, that's weird. Why is $thing happening in this test? It shouldn't even touch that component."

If you peek, you really have to commit to running the test for the full duration no matter what.

reply
thaumasiotes
28 days ago
[-]
No you don't. If your protocol involves peeking (and early stopping), you need different thresholds to declare statistical significance. But you can do that. You just need to know whether you're peeking or not, which everybody does.
reply
beejiu
28 days ago
[-]
> If you peek, you really have to commit to running the test for the full duration no matter what.

It's more complicated, but you can also run sequential A/B testings using [SPRT](https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...) or similar, where a test gets accepted or rejected once it hits a threshold. I won't go into the details, but you can incrementally calculate the test statistic, so if your test is performing very badly or well, the test will end early.

One product team I worked in run all tests as sequential tests. If you build a framework around this, I'd argue it's easier for statistics-unaware stakeholders to understand when you _can_ end a test early.

reply
aflag
28 days ago
[-]
If there is a bug, then the experiment needs to be called off and a new one constructed. You shouldn't change anything else during the execution of the experiment.
reply
iamcreasy
28 days ago
[-]
The article says 'Changing the color of the "Proceed to checkout" button will increase purchases.' is a bad hypothesis because it is underspecified.

But what else is there to measure other than checkout button click count(and follow up purchases) to measure the effect of button color change?

Or perhaps this is not a robust example to illustrates undespeficaition?

reply
gregbarbosa
20 days ago
[-]
> But what else is there to measure other than checkout button click count(and follow up purchases) to measure the effect of button color change?

Purchases occur on the checkout page itself. Its design, payment input, and upsells can all impact results, potentially counteracting the button color's effects. You need a clearer hypothesis to address these.

reply
KTibow
28 days ago
[-]
The number of people who start checkout and the number of people who check out are different, and I think that was what they meant
reply
ImageXav
28 days ago
[-]
Here's another one that I feel is often overlooked by traditional A/B testers: if you have multiple changes, don't simply test them independently. Learn about fractional factorial experiments and interactions, and design your experiment accordingly. You'll get a much more relevant result.

My impression is that companies like to add/test a lot of features separately - and individually these features are good, but together they form complex clutter and end up being a net negative.

reply