This is a useful result, but it is important to note that this is not necessarily what people have in mind when they think of "LLMs generating skills." Having the LLM write down a skill representing the lessons from the struggle you just had to get something done is more typical (I hope) and quite different from what they're referring to.
I'm sure news outlets and popular social media accounts will use appropriate caution in reporting this, and nobody will misunderstand it.
So when we look at the prompt they gave to have the agent generate its own skills:
> Important: Generate Skills First Before attempting to solve this task, please follow these steps: 1. Analyze the task requirements and identify what domain knowledge, APIs, or techniques are needed. 2. Write 1–5 modular skill documents that would help solve this task. Each skill should: focus on a specific tool, library, API, or technique; include installation/setup instructions if applicable; provide code examples and usage patterns; be reusable for similar tasks. 3. Save each skill as a markdown file in the environment/skills/ directory with a descriptive name. 4. Then solve the task using the skills you created as reference.
There's literally nothing it can do by way of "exploration" to populate and distill self-generated skills - not with a web search, not exploring an existing codebase for best practices and key files - only within its own hallucinations around the task description.
It also seems they're not even restarting the session after skills are generated, from that fourth bullet? So it's just regurgitating the context that was used to generate the skills.
So yeah, your empty-codebase vibe coding agent can't just "plan harder" and make itself better. But this is a misleading result for any other context, including the context where you ask for a second feature on that just-vibe-coded codebase with a fresh session.
If you gave this exact prompt to a senior engineer I would expect them to throw it back and ask wtf you actually want.
LLMs are not mind readers.
I think it's because AI Models have learned that we prefer answers that are confident sounding, and not to pester us with questions before giving us an answer.
That is, follow my prompt, and don't bother me about it.
Because if I am coming to an AI Agent to do something, it's because I'd rather be doing something else.
The headline is really bullshit, yes, I like the testing tho.
Even though my CLAUDE.md is small though, often my rules are ignored. Not always though, so it's still at least somewhat useful!
have a hook on switching out of plan, and maybe on edits, that passes the change to haiku with the claude.md to see if it matches or not
Then I do it again from scratch; this time it takes less steering. I have it update the skill further.
I've been doing this on a few different tests and building a skill which is taking less and steering to do app-specific and team-specific manual testing faster and faster. The first times through it took longer than manually testing the feature. While I've only started doing this recently, it is now taking less time than I would take, and posting screenshots of the results and testing steps in the PR for dev review. Ongoing exploration!
If everything you want an LLM do is already captured as code or simple skills, you can switch to dumber models which know enough about selecting the appropriate skill for a given user input, and not much else. You would only have to tap into more expensive heavy duty LLMs when you are trying to do something that hasn’t been done before.
Naturally, AI companies with vested interest in making sure you use as many tokens as possible will do everything they can to steer you away from this type of architecture. It’s a cache for LLM reasoning.
No, the actual incentive is that people will eventually benchmark their models on bang-per-buck basis and models that chew through tokens are not going to be competitive. It's the same reason why the "Intel/AMD are intentionally sandbagging their CPUs so they can sell more CPUs" theory doesn't work.
At least currently in AI there is no moat so we wouldn't expect that to be occurring
You mean the dude who writes articles on TechCrunch and Ars Technica based off of HN and Reddit thread titles because he doesn't understand what real journalism is? Sure, we can count on him :)
Just as of last week I had Claude build me a skill when I ask it to help me troubleshoot issues, and it came out quite good.
It did had some issues (Claude tends to o er specify over anecdotal data) but it's a strong step in the right direction.
Also, "skills" are too broad in my opinion. I have one (that Claude wrote) with my personal data that I have available when I analyze my workouts.
I think there's ample room for self-generated skills when you use a rather long exchange on a domain you plan to revisit, _specially_ when it comes to telling Claude what not to do.
I’m reading this paper as don’t do this. If you deploy agents to your workforce and tell them to use skills, don’t. Tell them to give it tasks. This sounds obvious but might not be to everyone. (And in any case, it’s nice for researchers to have confirmed pre-prompt skill writing doesn’t work. It would have been neat if it had.)
:D
> A common pitfall is for Claude to create skills and fill them up with generated information about how to complete a task. The problem with this is that the generated content is all content that's already inside Claude's probability space. Claude is effectively telling itself information that it already knows!
> Instead, Claude should strive to document in SKILL.md only information that:
> 1. Is outside of Claude's training data (information that Claude had to learn through research, experimentation, or experience) > 2. Is context specific (something that Claude knows now, but won't know later after its context window is cleared) > 3. Aligns future Claude with current Claude (information that will guide future Claude in acting how we want it to act)
> Claude should also avoid recording derived data. Lead a horse to water, don't teach it how to drink. If there's an easily available source that will tell Claude all it needs to know, point Claude at that source. If the information Claude needs can be trivially derived from information Claude already knows or has already been provided, don't provide the derived data.
For those interested the full skill is here: https://github.com/j-r-beckett/SpeedReader/blob/main/.claude...
Claude's training data is the internet. The internet is full of Express tutorials that use app.use(cors()) with no origin restriction. Stack Overflow answers that store JWTs in localStorage, etc.
Claude's probability space isn't a clean hierarchy of "best to worst." It's a weighted distribution shaped by frequency in training data.
So even though it "knows" stuff, it doesn't necessarily know what you want, or what a professional in production environment do.
Unless I'm missing something?
It's fairly common we notice these types of threads where one thing is being postulated and then there's comments upon comments of doer's showing what they have done.
Web3 and JavaScript frameworks never had the nerd-sniping power of the AI ecosystem. I'm not denying the usefulness and potential of the space, and the achievements of its current champions, but the degree with which it has consumed discussion and productivity in the tech space is worrying.
This article would be wildly interesting with the opposite headline, but instead it simply states what many of us would assume based on experience.
that being said, I think you're right that all of this will be a moot point in like 2 weeks or 2 months, when the next AI model is released that addresses this specific friction
and yeah, that's sad. there are a lot of people in companies being instructed to pivot to skills, and then before they can launch or sell their procedurally generated moat, the next AI model will procedurally generate skills better
nobody knows what to do for guaranteed food and shelter so they're grasping
+4.5pp for software engineering is suspiciously low compared to +51.9pp for healthcare. I suspect this reflects that frontier models already have strong SWE priors from training data, so skills add less marginal value. If true, skills become most valuable precisely in the domains where models are weakest — which is where you'd actually want to deploy agents in production. That's encouraging.
This stood out for me as well. I do think that LLMs have a lot of training data on software engineering topics and that perhaps explains the large discrepancy. My experience has been that if I am working with a software library or tool that is very new or not commonly used, skills really shine there. Example: Adobe React Spectrum UI library. Without skills, Opus 4.6 produces utter garbage when trying to use this library. With properly curated/created skills, it shines. Massive difference.
If you have the idea, more or less the implementation plan, let the LLM do the coding, you can end up with something maintainable and nice, it's basically up to you.
Strip away one layer, so you have the idea, but let the LLM come up with the implementation plan, then also the implementation, and things end up a lot less than ideal.
Remove another layer, let the LLM do it all, and it's all a mess.
I conjecture that after some years of LLMs reading a SharePoint site, producing summaries, then summaries of those summaries, etc... We will end up with a grotesque slurry.
At some point, fresh human input is needed to inject something meaningful into the process.
I have actually found something close to the opposite. I work on a large codebase and I often use the LLM to generate artifacts before performing the task (for complex tasks). I use a prompt to say "go explore this area if the code and write about it". It documents concepts and has pointers to specific code. Then a fresh session can use that without reading the stuff that doesn't matter. It uses more tokens overall, but includes important details that can get totally missed when you just let it go.
Or did you hyperfixate on the colloquial usage of zip
Reading this on HN... Sic transit gloria mundi!
What people have this misunderstanding?
That's not been my experience at all, what model and prompt would you use for that? Every single one I've tried is oblivious to if a design makes sense or not unless explicitly prompted for it with constraints, future ideas and so on.
What's the point of building skills like this?
When you create a skill for a particular model, you don't typically ask the model to create the skill based solely on its own latent knowledge. Otherwise, you'd expect the effect to be similar to telling the model 'make a plan before acting, make not mistakes'.
But that's what the paper's authors did!
When they say 'self-generated' they don't allow the model any tool access at all, not even web search.
It would be much more interesting if they had tested skills that were created in one of these ways:
A) The model interviews a human and then creates the skill, or
B) The model executes one or more deep research tasks in order to gather information, or
C) Some combo of the above.
This!
The only surprising part about the paper is that somebody wrote a paper on skills without a good understanding of the topic.
And also I’ve seen my manager LARP as an engineer by asking a model to generate a best practices doc for a service repo without supplying any additional context. So this sort of paper helps discourage that behavior.
This is like saying the CLAUDE.md or AGENTS.md is irrelevant because the LLM generated it.
Have there not been previous iterations of these tools where such techniques were actually effective?
(This also suggests that you should expect them to generally be bad at judging novel self-generated prompts/skills - if they could judge those, they would already be using them! There is a generator-verifier gap, but it is already exploited heavily during post-training and not much low-hanging fruit left there.)
I agree. (And it seems like it already stopped working, if I understood others here correctly.)
But again if I understood others here correctly, an academic paper like this would necessarily be studying models that are well behind the leading edge at time of publication. My argument is that the study authors shouldn't be faulted for investigating something that currently seems unlikely to work, because at the time of investigation it would have seemed much more likely to work.
I have seen some devs pull out absolutely bad guidance by introspecting the code with the LLM to define "best practices" and docs because it introduces its own encoded biases in there. The devs are so lazy that they can't be bothered to simply type the bullet points that define "good".
One example is that we had some extracted snippet for C#/.NET that was sprinkling in `ConfigureAwait(false)` which should not be in application code and generally not needed for ASP.NET. But the coding agent saw some code that looked like "library" code and decided to apply it and then someone ran the LLM against that and pulled out "best practices" and placed them into the repo and started to pollute the rest of the context.
I caught this when I found the code in a PR and then found the source and zeroed it out. We've also had to untangle some egregious use of `Task.Run` (again, not best practice in C# and you really want to know what you're doing with it).
At the end of it, we are building a new system that is meant to compose and serve curated, best practice guidance to coding agents to get better consistency and quality. The usage of self-generated skills and knowledge seems like those experiments where people feed in an image and ask the LLM to give back the image without changing it. After n cycles, it is invariably deeply mutated from the original.
Agentic coding is the future, but people have not yet adapted. We went from punch cards to assembly to FORTRAN to C to JavaScript; each step adding more abstractions. The next abstraction is Markdown and I think that teams that invest their time in writing and curating markdown will create better guardrails within which agents can operate without sacrificing quality, security, performance, maintainability, and other non-functional aspects of software system.
I don't completely disagree (I've argued the same point myself). But one critical difference between the LLM layer and all of those others you listed, is that LLMs are non-deterministic and all those other layers are. I'm not sure how that changes the dynamic, but surely it does.
So long as you supply the agent well-curated set of guidance, it should ultimately produce more consistent code with higher quality than if the same task were given to a team of random humans of varying skill and experience levels.
The key now is how much a team invests in writing the high quality guidance in the first place.
I've been building AI agent systems for clients and the pattern that works is iterative: the agent tries something, you steer it, then you capture what worked as a reusable skill. Not "generate skills before solving" but "distill lessons after solving." The paper tests the former, which nobody experienced actually does.
The real value of skills is reducing token burn on repeat tasks. Once you've figured out the right approach, you encode it so next time the model doesn't have to re-derive everything from first principles. It's memoization for reasoning.
This is the correct way vast majority of the time. There are exceptions. When I know for certain that the models do not have enough training material on a new library or one that isn't often used or an internal tool. In those cases I know I will have struggle on my hand if I don't start out with a skill that teaches the model the basics of what it does not know. I then update the skill with more polish as we discover additional ways it can be improved. Any errors the model makes are used to improve existing skills or create new ones.
But it seems pretty surprising to me. The training corpus contains so much information and the models operate at the level of… a bright novice. It seems like there obviously ought to be more insights to derive from looking harder at aspects of the corpus.
Why isn’t this considered astonishing?
1. only information and instructions on how to answer 2. some defined actions (run specific cli commands for specific tasks, use this api with those parameters) 3. skills including scripts
1 seems to be of limited use
2 and 3 can save the agent quite some time for finding a solution. And once the agent found a programmatic solution to a specific problem, they can store this information in a skill
I think that most of the adoption around Agent Skills would have a focus on ease of use, standarization and context management and not correctness.
My own thoughts on how to approach skill building target people who are adopting LLM development now more than ever although this was definitely possible (in a non standard way before) [1]
[1] https://alexhans.github.io/posts/series/evals/building-agent...
Chaos Congress talk on this from a couple months ago, jump to the coding loops part: https://media.ccc.de/v/39c3-breaking-bots-cheating-at-blue-t... . The talk focuses mostly on MCPs, but we now use the same flow for Skills.
This kind of experience makes me more hesitant to take on plugin and skill repos lacking evals or equivalent proving measurable quality over what the LLM knows and harness can handle. Generally a small number of things end up mattering majorly, but they end up being pivotal to get right, and the rest is a death by a thousand cuts.
For repetitive tasks, that could still be a good way to save on tokens and cost, while still remaining fully automated.
This was realized in 2023 already: https://newsletter.semianalysis.com/p/google-we-have-no-moat...
"Less is best" is not a new realization. The concept exists across contexts. Music described as "overplayed". Prose described as verbose.
We just went through an era of compute that chanted "break down your monoliths". NPM ecosystem being lots of small little packages to compose together. Unix philosophy of small composable utilities is another example.
So models will improve as they are compressed, skeletonized down to opcodes, geometric models to render, including geometry for text as the bytecode patterns for such will provide the simplest model for recreating the most outputs. Compressing out useless semantics from the state of the machines operations and leaving the user to apply labels at the presentation layer.
The "no moat" memo you linked was about open source catching up to closed models through fine-tuning, not about small models outperforming large ones.
I'm also not sure what "skeletonized down to opcodes" or "geometry for text as bytecode patterns" means in the context of neural networks. Model compression is a real field (quantization, distillation, pruning) but none of it works the way you're describing here.
Just asking a model "how good is this skill?" may or may not work, possibly the next laziest thing you could do - that's still "for cheap" - is asking the model to make a quiz for itself, and have it take the quiz with and without access to the skill, then see how the skill improved it. But there's still many problems with that approach. But would it be useful enough to work well enough much of the time for just heuristically estimating the quality of a skill?
The derivative of a LLM agent's capabilities (on its own) is negative. It's not that they can't do useful work -- it means that (for now) they require some level of input or steering.
If that were to change -- if an agent could consistently get better at what it does without intervention -- that would represent a true paradigm shift. An accelerating curve, rather than one trending back towards linearity.
This represents a necessary inflection point for any sort of AI "takeoff" scenario.
So this study is actually kind of important, even though it's a null result. Because the contra view would be immensely significant.
Others here have suggested that AIs should be able to self-generate skills by doing web searches. What happens when all of the information from web searches (of knowledge generated by ordinary human intelligence) has been extracted?
On another post (about crackpot Nick Bostrom claiming that an ASI would "imminently" lead to scientific breakthroughs like curing Alzheimers and so a 3% chance of developing an ASI would be worth a 97% chance of annihilating humanity) I noted that an ASI isn't a genie or magic wand; it can't find the greatest prime or solve the halting problem. Another person noted that an ASI can't figure out how to do a linear search in O(1) time. (We already know how to do a table lookup in amortized O(1) time--build a hash table.) Science is like animal breeding and many other processes ... there's a limit to how much it can be sped up.
FWIW I didn't read the paper and am judging it based on its title, which I think is fair because "self-generated agent skills" is a pretty loose definition.
Not even sure how you envision continuous learning, but if you mean model updates, I'm not sure the economics work out
What Ai's get is a cheat sheet for the session
What you are suggesting is a very expensive late-training phase activity. It's also not clear anymore when fine-tuning helps or hurts. Progress is rapid
I mean, basically it's doing the same thing as reasoning IIUC, except up-front rather than inline and ad-hoc, so I'd almost expect it to work even better than reasoning alone.
OTOH something I know innately how to do, like long division, writing down the algorithm doesn't help at all. In fact if someone just gave me that algorithm and for whatever reason I didn't recognize what it was, I'd have a lot harder time following the instructions than just innately dividing the numbers.
If course anthropomorphizing is always dangerous, but it does provide potential reasons why my above rationale could be wrong.
However, I've found them to be useful for capturing instructions on how to use other tools (e.g. hints on how to use command-line tools or APIs). I treat them like mini CLAUDE.mds that are specific only to certain workflows.
When Claude isn't able to use a Skill well, I ask it to reflect on why, and update the Skill to clarify, adding or removing detail as necessary.
With these Skills in place, the agent is able to do things it would really struggle with otherwise, having to consume a lot of tokens failing to use the tools and looking up documentation, etc.
We are probably undervaluing the human part of the feedback loop in this discussion. Claude is able to solve the problem given the appropriate human feedback — many then jump to the conclusion that well, if Claude is capable of doing it under some circumstances, we just need to figure out how to remove the human part so that Claude can eventually figure it out itself.
Humans are still serving a very crucial role in disambiguation, and in centering the most salient information. We do this based on our situational context, which comes from hands-on knowledge of the problem space. I'm hesitant to assume that because Claude CAN bootstrap skills (which is damn impressive!), it would somehow eventually do so entirely on its own, devoid of any situational context beyond a natural language spec.
I imagine some more like. https://github.com/ryanthedev/code-foundations
Based of an actual software book.
1. You MUST review and correct them
2. Embrace minimalism, they are spark notes and an index, not comprehensive
3. Force them into context
I imagine similar concepts hold for skills
Also generating skills using top of the line model to keep using them later in cheap open weights model seems like a good use of resources.
Online sharing of skills generated in such manner also seems like a wonderful idea.
I have systemized and automated businesses for a long time before LLMs came out, which generally wasn't very popular.
It is really weird to see everyone get excited about this kind of automation and then try to jump to the end points with something that's non-deterministic and wonder why it doesn't work like every other computer they've used (all or none).
Agents can self generate skills, maybe not effortlessly, or with psychic skills of reading between the lines (special exception for Claude), it's also about the framework and scaffolding in which to create skills that work, and what can be brought back to the "self-generation".
Without experience in creating computer skills in general, attempts for self-generating agent skills is kind of trying to use AI to autocomplete a sentence and then not like how it went. To a fair degree it can be lined up to improve considerably.
Right now there seems to be a 6-12 month lag between studies like these and it being shared/reported in the wild.
Too often, they are researching something reported in the wild and trying to study it, and it very well may work for some cases, but not all cases, and the research kind of entirely misses it.
With AI, it's incredibly important to follow show and not tell.
Sharing this from genuine curiousity if this resonates with anyone, and if so, how/where.
> Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something.