I wish this was a metric for the AI benchmarks so I could choose a model based on this, because honestly it's one of the things I care most about.
Problem: How can you measure such things, whats the metrcis?
...maybe there just isn't a way to do it, since that metric isn't in the charts..
Other than benchmarks, I'd say that's your own test suite