Might want to check the cites for more recent work ("Efficiency in sequential testing" looks relevant) and also the literature on "bandit best arm identification", which seems to be distinct from this line of work but tackles broadly the same problem.
A/B tests, monitoring metrics, health, quality control all use this.
If you use LLMs, you might use this to determine if a model update or prompt change impacts results using fewer tokens.