> In fact, while the new flow worked great on mobile, conversion was lower on desktop – an insight we missed when we combined these metrics.
> This phenomenon is known as Simpson's paradox – i.e. when experiments show one outcome when analyzed at an aggregated level, but a different one when analyzed by subgroups.
There's nothing strange about finding out that some groups benefit and others lose out when diving up you data. You're looking at an average and some parts are positive and others are negative. Where's the paradox there?
Simpson's paradox is when more button presses lead to more purchases. But then you look at desktop vs mobile and you find out that for both desktop and mobile more clicks doesn't mean more purchases (or worse, more clicks means fewer purchases).
That's why it's a paradox. The association between two variables exists at the aggregate level but doesn't exist or is backwards when you split up the population. It's not a statement about the average performance of something.
I would add a 7th A/B testing mistake to that list and it's not learning about basic probability, statical tests, power, etc. Flying by the seat of your pants when statistics are involved always ends badly.
How could more button presses lead to increased conversion rates while hiding this data when comparing desktop and mobile? Wouldn’t you see at least one device type demonstrating higher CVR to reflect aggregate CVR increase?
You can take data where as a whole presses lead to more purchases. Then split it into two halves (like mobile vs desktop) and show that on both halves presses lead to fewer purchases.
The whole paradox is that the intuition we have for averages doesn't apply to correlations.
I suggest checking out the Wikipedia page.
This is where most tests fail, in my experience.
Everyone wants to run A/B tests because that’s what the big co’s are doing and they want to look like the sort of person BigCo might hire, but they’re making silly mistakes because stats is hard and not taught well at school.
Sure, conversion would go up if it was in the first slot versus the last, but it takes effort (however little) to scroll through a carousel, so ensuring that we can measure the quality of the result and not just the quantity is really important.
This is one way I've tried to avoid the problem you describe. It's not enough that people can engage with the feature, but they need to engage with it meaningfully and in such a way that would encourage repeat behavior.
Another example of what you describe is that on our site if you search for "neon blue guitar", don't interact with the search results whatsoever, go back to the home page, click on a product on a carousel, and purchase _that_ product, it counts as a "successful search event," even though the search technically failed because they didn't interact with it in any meaningful way. To your point: PM is happy; user is not.
TL;DR it's really important to think through tests and how you measure success!
Depends, does that increase overall sales? Or is it ‘better’ to make the customer ‘walk past’ the other items to get to the thing they want (the way supermarkets make you walk up the back of the shop to get to the milk), and maybe buy something else too?
You might look end up looking at lots of different slices of your data, and you might come to the conclusion, "Oh, it looks like France is statistically significant negative on our new signup flow changes".
It's important to make sure you have a hypothesis for the given slice before you start the experiment and not just hunt for outliers after the fact, or otherwise you're just p-hacking [1].
Srgmenting and data dredging is fine provided you run a new test with fresh data to validate if there is a causal relationship in any correlations found.
The temptation to "peek" and keep on peeking until the test confesses to the thing you want it to say is very high.
If you peek, you really have to commit to running the test for the full duration no matter what.
It's more complicated, but you can also run sequential A/B testings using [SPRT](https://en.wikipedia.org/wiki/Sequential_probability_ratio_t...) or similar, where a test gets accepted or rejected once it hits a threshold. I won't go into the details, but you can incrementally calculate the test statistic, so if your test is performing very badly or well, the test will end early.
One product team I worked in run all tests as sequential tests. If you build a framework around this, I'd argue it's easier for statistics-unaware stakeholders to understand when you _can_ end a test early.
But what else is there to measure other than checkout button click count(and follow up purchases) to measure the effect of button color change?
Or perhaps this is not a robust example to illustrates undespeficaition?
Purchases occur on the checkout page itself. Its design, payment input, and upsells can all impact results, potentially counteracting the button color's effects. You need a clearer hypothesis to address these.
My impression is that companies like to add/test a lot of features separately - and individually these features are good, but together they form complex clutter and end up being a net negative.