FilterHN

Designing Pareto-optimal RAG workflows with syftr

67 points

by roma_glushko

4 months ago

| past

| 5 comments

| datarobot.com

| HN

▲

charcircuit

4 months ago

[-]

It sounds impossible to be paretooptimal for complicated problems. How do you know GPT-4o-mini would be optimal. I feel like there is always room on the table for a potential GPT-5o-mini to be more optimal. The solution space of possible gen ai models is gigantic, so we can only improve our solution over time and never find the most optimal one.

▲

brianbelljr

4 months ago

[-]

Yes, maybe theoretically. Practically though you will have to ship your agent with the LLMs that are available today and you will need to pick one. I don’t think the authors were trying to solve for like “best forever”,probably wasn’t their intent. For that I think you would need some kind of proof which sort of says that some kind of theoretical maximum is reached, and a proof like that is not a thing in most applied computer science fields.

▲

simianwords

4 months ago

[-]

Interesting but I'm a bit lost. You are optimising but how do you know the ground truth of "good" and "bad"? Do you manually run the workflow and then decide based on a predefined metric?

Or do you rely on generic benchmarks?

▲

viraptor

4 months ago

[-]

https://github.com/datarobot/syftr/blob/main/docs/datasets.m...

You need custom QA pairs for custom scenarios.

▲

roma_glushko

4 months ago

[-]

A new OSS framework uses multi-objective Bayesian Optimization to efficiently search for Pareto-optimal RAG workflows, balancing cost, accuracy, and latency across configurations that would be impossible to test manually.

▲

roma_glushko

4 months ago

[-]

Useful links:

Github: https://github.com/datarobot/syftr

Paper: https://arxiv.org/abs/2505.20266

▲

diabolicalrobot

4 months ago

[-]

I am a member of the syftr team. Please feel free to ask questions.

▲

timhigins

4 months ago

[-]

This looks super cool! Seems like a stronger statistics-based optimization strategy than https://docs.auto-rag.com/optimization/optimization.html.

I've got a few questions:

Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?

How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?

Suppose I want to create and optimize for my own LLM-as-a-judge metrics like https://mastra.ai/en/docs/evals/textual-evals#available-metr..., how can I do this?

Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.

Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?

Cool stuff, I'm hoping this approach is more widely adopted!

▲

physicalrobot

4 months ago

[-]

I am on the team. Here are some responses to your questions:

>Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?

See the section on the Pareto Pruner in the paper where we evaluate the performance of ignoring runs on the optimization. It shows that ignoring these points does not lead to suboptimal configurations. We have at various times implemented fixed cost / latency thresholds at various times during development to reduce our cost exposure.

>How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?

That is part of why we jointly optimize the embedding and retriever settings. For models that only support shorter context lengths the retrieved documents will be truncated to avoid errors, if they leads to poor performance it should try and find better settings (e.g. by being more selective with a reranker).

>Suppose I want to create and optimize for my own LLM-as-a-judge metrics.

Please reach out on our github and we can discuss in code in more detail there. We are interested in expanding the options to make syftr more extensible but are curious exactly how people are interesting in using it for more methods, especially whether you are interested in exploring higher-dimensional spaces (accuracy + cost + additional metrics).

> Are you going to flesh out the docs? We plan to, we focused on writing the paper and explaining how syftr works. Now we plan to do more attention on the details of how to run experiments, customize etc.

> Any recommendations for creating the initial QA dataset for benchmarking? We used HuggingFace datasets as the dataset format. If there is interest on iterating on versions of the dataset on HuggingFace, that is something we could look at. You can see the current datasets here: https://huggingface.co/DataRobot-Research

▲

djoldman

4 months ago

[-]

Given section A7 in your paper: https://arxiv.org/pdf/2505.20266

...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?

Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?

▲

physicalrobot

4 months ago

[-]

We did two separate studies, one of accuracy and cost the other of accuracy and latency. Most of our studies used cost as the definition is pretty clear and less sensitive to environmental conditions (e.g. LLM provider changes) or definition (e.g. typical vs. worst case). But latency is a concern for many usecases, so we did want to investigate that problem as well.

Note we used Random LLM which had a 0.84 correlation with human labeling rather than 0.90 from only using gpt-4o-mini.

▲

andriyvel

4 months ago

[-]

looks interesting!

▲

jstummbillig

4 months ago

[-]

and exhausting.