Research-Driven Agents: What Happens When Your Agent Reads Before It Codes
53 points
2 hours ago
| 11 comments
| blog.skypilot.co
| HN
maCDzP
25 seconds ago
[-]
I have a ML project. I usually set up a team of agents, where I have a leader, archivist, research assistant, researcher, developer and tester. The team generates hypothesis based on papers, test it, and iterate over that. Everything is documented using a lab notebook. It burns tokens but I have found some promising strategies that I am testing.
reply
simlevesque
31 minutes ago
[-]
I've been making skills from arxiv papers for a while. I have a one for multi-object tracking for example. It has a SKILL.md describing all important papers (over 30) on the subject and a folder with each paper's full content as reStructuredText.

To feed Arxiv papers to LLMs I found that RST gives the best token count/fidelity ratio. Markdown lacks precision. LateX is too verbose. I have a script with the paper's urls, name and date that downloads the LateX zips from Arxiv, extracts it, transforms them to RST and then adds them to the right folder. Then I ask a LLM to make a summary from the full text, then I give other LLMs the full paper again with the summary and ask them to improve on and and proofread them. While this goes on I read the papers myself and at the end I read the summaries and if I approve them I add it to the skill. I also add for each paper info on how well the algorithms described do in common benchmarks.

I highly recommend doing something similar if you're working in a cutting-edge domain. Also I'd like to know if anyone has recommendations to improve what I do.

reply
ctoth
5 minutes ago
[-]
I've been working on ctoth/research-papers-plugin, the pipeline to actually get LLMs to extract the notes. I really like your insight re RST over Markdown! It sounds like we're working on similar stuff and I'll absolutely reach out :)
reply
paulluuk
14 minutes ago
[-]
This sounds like it would work, but honestly if you've already read all 30 papers fully, what do you still need to llm to do for you? Just the boilerplate?
reply
simlevesque
7 minutes ago
[-]
I'm trying to make a go library that implements a wide ranges of MOT algorithms and can gather metrics for all of them.

Reading all the papers once isn't the same as this. I find it very useful.

I can ask an LLM to do the basic implementations, then I can refine them (make the code better, faster, cut on memory use), then I can ask the LLM if I'm still implementing the algorithms as they're described in the paper.

reply
alex000kim
23 minutes ago
[-]
sounds similar to "LLM Knowledge Bases" https://xcancel.com/karpathy/status/2039805659525644595
reply
MrLeap
15 minutes ago
[-]
What is RST?
reply
simlevesque
12 minutes ago
[-]
reply
ctoth
7 minutes ago
[-]
I've been very interested in this recently. I'm pretty sure that every project should have a ./papers directory of annotated papers in it like I do in Qlatt[0].

Literally every project. If it's something that's been done a million times then that means it has good literature on it? If not, then even more important to find related stuff! And not just crunchy CS stuff like databases or compilers or whatever. Are you creating a UI? There's probably been great UI research you can base off of! Will this game loop be fun in the game you're building? There's probably been research about it!

[0]: https://github.com/ctoth/Qlatt/blob/master/papers/

reply
zzleeper
2 minutes ago
[-]
Wow this is amazing. Did you write all those MD files by hand, or used an LLM for the simple stuff like extracting abstracts?
reply
alex000kim
3 minutes ago
[-]
That directory is huge already! I guess the index.md helps the agent find what it needs, but even the markdown file is very long - this would consume a ton of tokens.

Also I wonder who/what decides what papers go in there.

In the blog post, the agent is allowed to do its own search.

reply
dataviz1000
1 hour ago
[-]
(Sorry to spam.)

I'm working on this also from a different angle. Hopefully sharing adds to the conversation.

First, about the loop, Claude's (coding agent) context and attention is big enough to self-reflect. Agent Tuning shows a technique that not only demonstrates this but a way quantify it. [0] The difference is autoresearch's val_bpb measures what the agent built; Agent Tuning's p̂ measures the agent itself.

> Claude's attention doesn't distinguish between "instructions I'm writing" and "instructions I'm following" -- they're both just tokens in context.

Second, doing research, finding academic research to add to context helps. Here is an example of an implementation that creates trading strategies by reading research and recreating them in creative new ways. [1]

The biggest problem is the coding agents don't "Fail fast and loud". They fail deceivingly.

[0] https://github.com/adam-s/agent-tuning

[1] https://github.com/adam-s/alphadidactic

reply
KingOfCoders
1 hour ago
[-]
I use #PPPCDC for prompting: plan,plan,plan then verify with: Compare the plan to the existing Code. Reread and compare the plan to the Docs. Fix the areas you're not Confident about.
reply
austinbaggio
27 minutes ago
[-]
Research step makes sense, can also confirm that running multiple agents with diverse strategies also compound results more quickly than single agents
reply
alex000kim
20 minutes ago
[-]
I am sure this would works well in general. There is a challenge wrt to how to make them communicate effectively to e.g. 1) avoid duplicative work and 2) allow them to combine/overlay each others' findings to yield even better results
reply
hungryhobbit
55 minutes ago
[-]
I think anyone who uses Claude knows that it works smarter when you have it make a plan first, and ask it to research the existing code as much as possible first ... so the results in this article doesn't surprise me at all.

However, I'd be curious to hear back from others who have tried adding the shell script (at the end of the article) to their flow: does it (really) improve Claude?

reply
hopechong
2 hours ago
[-]
Coding agents that read papers before writing code find optimizations that code-only agents miss.

We added a literature review phase to Karpathy’s autoresearch loop and pointed it at llama.cpp. The agent autonomously read arxiv papers, studied competing forks and spun up VMs to run parallel experiments.

reply
outside1234
20 minutes ago
[-]
A research step (gather insights from across the codebase and internet for how to accomplish the next step), planning step (how should I sequence implementation given that research), an implementation step, and a verification step (code review of the implementation) is super effective workflow for me.
reply
alex000kim
17 minutes ago
[-]
yup, as the blog says

> The full setup works with any project that has a benchmark and test suite.

so having a clear and measurable verification step is key. Meaning you can't simply give an AI agent a vague goal e.g. "improve the quality of the codebase" because it's too general.

reply
phendrenad2
1 hour ago
[-]
This is obvious, right? If you want to build a Facebook clone, you wouldn't tell the agent "build Facebook". You would provide it with a description of every page on Facebook, behaviors, interactions, UI, etc.
reply
faeyanpiraat
47 minutes ago
[-]
Have you even read the TL;DR in the linked article??
reply
phendrenad2
40 minutes ago
[-]
You mean this part?

> TL;DR: Coding agents generate better optimizations when they read papers and study competing projects before touching code

What made you think I hadn't read the article, let alone that TL;DR? I'm really curious. Jumping to an insulting "have you read the article" is a big step, so it'll be really interesting to see where your mind went.

reply
doctorpangloss
50 minutes ago
[-]
The skypilot devs need to focus on decoupling their offering, so that their very valuable "find the cheapest cloud" functionality isn't married to a glitchy reinvention of Kubernetes JobSet and MLflow
reply