Betting on DSPy for Systems of LLMs
83 points
28 days ago
| 6 comments
| blog.isaacmiller.dev
| HN
bart_spoon
28 days ago
[-]
The more I’ve looked at DSPy, the less impressed I am. The design of the project is very confusing with non-sensical, convoluted abstractions. And for all the discussion surrounding it, I’ve yet to see someone actually using for something other than a toy example. I’m not sure I’ve even seen someone prove it can do what it claims to in terms of prompt optimization.

It reminds me very much of Langchain in that it feels like a rushed, unnecessary set of abstractions that add more friction than actual benefit, and ultimately boils down to an attempt to stake a claim as a major framework in the still very young stages of LLMs, as opposed to solving an actual problem.

reply
isoprophlex
28 days ago
[-]
The magic sauce seems to be, at every turn, "... if you have some well defined metric to optimize on."

And that's not really a given, in reality. It allows all sorts of tricks to do what DSPy is aiming for, which you won't be able to do in real life.

Unless I'm sorely mistaken, but that's my take on the whole thing.

reply
isaacbmiller
28 days ago
[-]
Disclaimer: original blog author

> as opposed to solving an actual problem

This was literally the point of the post. No one really knows what the future of LLMs will look like, so DSPy just iteratively changes in the best way it can for your metric (your problem).

> someone actually using for something other than a toy example

DSPy, among the problems I listed in the post, has some scalability problems, too, but I am not going to take away from that. There are at least early signs of enterprise adoption from posts like this blog: https://www.databricks.com/blog/optimizing-databricks-llm-pi...

reply
curious_cat_163
28 days ago
[-]
The abstractions could be cleaner. I think some of the convolution is due to the evolution that it has undergone and core contributors have not come around to being fully “out with the old”.

I think there might be practical benefits to it. The XMC example illustrates it for me:

https://github.com/KarelDO/xmc.dspy

reply
isaacfung
27 days ago
[-]
This repo has some less trivial examples. https://github.com/ganarajpr/awesome-dspy

You can try STORM(also from Stanford) and see the prompts it generates automatically, it tries to expand on your topic and simulate the collaboration among several domain experts https://github.com/stanford-oval/storm

An example article I asked it to generate https://storm.genie.stanford.edu/article/how-the-number-of-o...

reply
Der_Einzige
28 days ago
[-]
Agreed 100%. DSPy along with libraries inspired by it (i.e. https://github.com/zou-group/textgrad) are nothing more than fancy prompt chains under the hood.

These libraries mostly exist as "cope" for the fact that we don't have good fine-tuning (i.e. lora) capabilities for ChatGPT et al, so we try to instead optimize the prompt.

reply
qeternity
28 days ago
[-]
Glad to see others saying this. I haven't looked at it in some months, but I previously realized it's mostly a very complicated way to optimize few-shot learning prompts. It's hardly whatever magical blackbox optimizer they try to market it as.
reply
dmarchand90
28 days ago
[-]
My guess is it will be like pascal or smalltalk, an important development for illustrating a concept but is ultimately replaced by something more rigorous
reply
isaacbmiller
28 days ago
[-]
> These libraries mostly exist as "cope"

> nothing more than fancy prompt chains under the hood

Some approaches using steering vectors, clever ways of fine-tuning, transfer decoding, some tree search sampling-esque approaches, and others all seem very promising.

DSPy is, yes, ultimately a fancy prompt chain. Even once we integrate some of the other approaches, I don't think it becomes a single-lever problem where we can only change one thing(e.g., fine-tune a model) and that solves all of our problems.

It will likely always be a combination of the few most powerful levers to pull.

reply
Der_Einzige
28 days ago
[-]
Correct, when I say "ChatGPT et al", I mean closed source paywalled LLMs, open access LLM personalization is an extreme gamechanger. All of what you mentioned is important, and I'm particularly excited about PyReft.

https://github.com/stanfordnlp/pyreft

Anything Christopher Manning touches turns to gold.

reply
okigan
28 days ago
[-]
Could we have a concise and specific explanation how DSPy works?

All I've seen are vague definitions of new terms (ex. signatures) and "trust me this very powerful and will optimize it all for you".

Also, what would a good way to reason between DSPy and TextGrad?

reply
curious_cat_163
28 days ago
[-]
My understanding is that is tries many variations of the set of few shot examples and prompts and picks the ones that work best as the optimized program.
reply
ktrnka
27 days ago
[-]
Textgrad mainly optimizes the prompt but does not inject few shot examples. Dspy mainly optimizes the few shot examples.

At least that's my understanding from reading the textgrad paper recently.

reply
thatsadude
28 days ago
[-]
I had a few problems with DSPy:

* Multi-hop reasoning rarely works with real data in my case. * Impossible to define advanced metrics over the whole dataset. * No async support

reply
gunalx
28 days ago
[-]
Not to say anything about dspy, but I really liked the take on hvat we should use llms for.

We need to stop doing useless reasoning stuff, and find acttual fitting problems for the llms to solve.

Current llms are not your db manager(if they could be you don't have a db size in the real world). They are not a developer. We have people for that.

Llms prove to be decent creative tools, classificators, and qna answer generators.

reply
isaacfung
27 days ago
[-]
We have translators. Doesn't mean we can't replace them with a cheaper, more accessible tool. That's the whole point of automation.

Reasoning stuff is not useless. They provably(according to benchmarks) improve the performance of coding and math related tasks.

reply
fsndz
28 days ago
[-]
reply
revskill
28 days ago
[-]
Whenever i see "ChainOfThought" for AI, it's an annoying and misleading term. Machine never never thinks at all.
reply