It reminds me very much of Langchain in that it feels like a rushed, unnecessary set of abstractions that add more friction than actual benefit, and ultimately boils down to an attempt to stake a claim as a major framework in the still very young stages of LLMs, as opposed to solving an actual problem.
And that's not really a given, in reality. It allows all sorts of tricks to do what DSPy is aiming for, which you won't be able to do in real life.
Unless I'm sorely mistaken, but that's my take on the whole thing.
> as opposed to solving an actual problem
This was literally the point of the post. No one really knows what the future of LLMs will look like, so DSPy just iteratively changes in the best way it can for your metric (your problem).
> someone actually using for something other than a toy example
DSPy, among the problems I listed in the post, has some scalability problems, too, but I am not going to take away from that. There are at least early signs of enterprise adoption from posts like this blog: https://www.databricks.com/blog/optimizing-databricks-llm-pi...
I think there might be practical benefits to it. The XMC example illustrates it for me:
You can try STORM(also from Stanford) and see the prompts it generates automatically, it tries to expand on your topic and simulate the collaboration among several domain experts https://github.com/stanford-oval/storm
An example article I asked it to generate https://storm.genie.stanford.edu/article/how-the-number-of-o...
These libraries mostly exist as "cope" for the fact that we don't have good fine-tuning (i.e. lora) capabilities for ChatGPT et al, so we try to instead optimize the prompt.
> nothing more than fancy prompt chains under the hood
Some approaches using steering vectors, clever ways of fine-tuning, transfer decoding, some tree search sampling-esque approaches, and others all seem very promising.
DSPy is, yes, ultimately a fancy prompt chain. Even once we integrate some of the other approaches, I don't think it becomes a single-lever problem where we can only change one thing(e.g., fine-tune a model) and that solves all of our problems.
It will likely always be a combination of the few most powerful levers to pull.
https://github.com/stanfordnlp/pyreft
Anything Christopher Manning touches turns to gold.
All I've seen are vague definitions of new terms (ex. signatures) and "trust me this very powerful and will optimize it all for you".
Also, what would a good way to reason between DSPy and TextGrad?
At least that's my understanding from reading the textgrad paper recently.
* Multi-hop reasoning rarely works with real data in my case. * Impossible to define advanced metrics over the whole dataset. * No async support
We need to stop doing useless reasoning stuff, and find acttual fitting problems for the llms to solve.
Current llms are not your db manager(if they could be you don't have a db size in the real world). They are not a developer. We have people for that.
Llms prove to be decent creative tools, classificators, and qna answer generators.
Reasoning stuff is not useless. They provably(according to benchmarks) improve the performance of coding and math related tasks.