Or do you rely on generic benchmarks?
You need custom QA pairs for custom scenarios.
I've got a few questions:
Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
Suppose I want to create and optimize for my own LLM-as-a-judge metrics like https://mastra.ai/en/docs/evals/textual-evals#available-metr..., how can I do this?
Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.
Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?
Cool stuff, I'm hoping this approach is more widely adopted!
>Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
See the section on the Pareto Pruner in the paper where we evaluate the performance of ignoring runs on the optimization. It shows that ignoring these points does not lead to suboptimal configurations. We have at various times implemented fixed cost / latency thresholds at various times during development to reduce our cost exposure.
>How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
That is part of why we jointly optimize the embedding and retriever settings. For models that only support shorter context lengths the retrieved documents will be truncated to avoid errors, if they leads to poor performance it should try and find better settings (e.g. by being more selective with a reranker).
>Suppose I want to create and optimize for my own LLM-as-a-judge metrics.
Please reach out on our github and we can discuss in code in more detail there. We are interested in expanding the options to make syftr more extensible but are curious exactly how people are interesting in using it for more methods, especially whether you are interested in exploring higher-dimensional spaces (accuracy + cost + additional metrics).
> Are you going to flesh out the docs? We plan to, we focused on writing the paper and explaining how syftr works. Now we plan to do more attention on the details of how to run experiments, customize etc.
> Any recommendations for creating the initial QA dataset for benchmarking? We used HuggingFace datasets as the dataset format. If there is interest on iterating on versions of the dataset on HuggingFace, that is something we could look at. You can see the current datasets here: https://huggingface.co/DataRobot-Research
...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?
Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?
Note we used Random LLM which had a 0.84 correlation with human labeling rather than 0.90 from only using gpt-4o-mini.