In that situation, your model accuracy will look good on holdout sets but underperform in user's hands.
Anyone know of any other similar tools that allow you to track across harnesses, while coding?
Running evals as a solo dev is too cost restrictive I think.
This project is somewhat unconventional in its approach, but that might reveal issues that are masked in typical benchmark datasets