I am struck by how much these kinds of context documents resemble normal developer documentation, but actually useful and task-oriented. What was the barrier to creating these documents before?
Three theories on why this is so different:
1) The feedback loop was too long. If you wrote some docs, you might never learn if they were any good. If you did, it might be years later. And if you changed them, doing an A/B test was impractical. Now, you can write up a context markdown, ask Claude to do something, and iterate in minutes.
2) The tools can help build them. Building good docs was always hard. Especially if you take the time to include examples, urls, etc. that make the documentation truly useful. These tools reduce this cost.
3) Many programmers are egotists. Documentation that helps other people doesn't generate internal motivation. But documentation that allows you to better harness a computer minion to your will is attractive.
Any other theories?
If you are a developer who is not writing the documents for consumption by AI, you are primarily writing documents for someone who is not you; you do not know what this person will need or if they will ever even look at them.
They may, of course, help you, but you may not understand that, have the time, or discipline.
If you are writing them because the AI using them will help you, you have a very strong and immediate incentive to document the necessary information. You also have the benefit of a short feedback loop.
Side note, thanks to the LLMs penchant of wiping out comments, I have a lot more docs these days and far fewer comments.
But the real problem with docs is that for MOST usecases, the audience and context of the readers matter HUGELY. Most docs are bad because we can't predict those. People waste ridiculous amounts of time writing docs that nobody reads or nobody needs based on hypotheses about the future that turn out to be false.
And _that_ is completely different when you're writing context-window documents. These aren't really documents describing any codebase or context within which the codebase exists in some timeless fashion, they're better understood as part of a _current_ plan for action on a acute, real concern. They're battle-tested the way docs only rarely are. And as a bonus, sure, they're retainable and might help for the next problem too, but that's not why they work; they work because they're useful in an almost testable way right away.
The exceptions to this pattern kind of prove the rule - people for years have done better at documenting isolatable dependencies, i.e. libraries - precisely because those happen to sit at boundaries where it's both easier to make decent predictions about future usage, and often also because those docs might have far larger readership, so it's more worth it to take the risk of having an incorrect hypothesis about the future wasting effort - the cost/benefit is skewed towards the benefit by sheer numbers and the kind of code it is.
Having said that, the dust hasn't settled on the best way to distill context like this. It's be a mistake to overanalyze the current situation and conclude that documentation is certain to be the long-term answer - it's definitely helpful now, but it's certainly conceivable that more automated and structured representations might emerge, or in forms better suited for machine consumption that look a little more alien to us than conventional docs.
They are useful to the LLM in writing the code (which comes after).
But when it comes to an LLM reading that code later its just a waste of context.
For humans its a waste of screen space.
A comment should only explain what the following thing does if its hard to parse for some reason.
Otherwise it should add information: why something is as it is, I.e. some special case, add breadcrumbs to other bits of the code etc.
I wish these coding agents had a post step to remove any LLMish comments they added during writing, and I want linters that flag these.
There's a piece of common knowledge that NBA basketball players can all hit over 90% on free throws, if they shot underhand (granny style). But for pride reasons, they don't throw underhand. Shaq just shot 52%, even though it'd be free points if he could easily shoot better.
I suspect there's similar things in software engineering. I've seen plenty of comments on HN about "adding code comments like a junior software engineer" or similar sentiment. Sure, there's legitimate gripes about comments (like how they can be misleading if you update the code without changing the comment, etc), but I strongly suspect they increase comprehension of code overall.
Personally, I remove redundant comments AI adds specifically to demonstrate that I have reviewed the code and believe that the AI's description is accurate. In many cases AIs will even add comments that only make sense as a response to my prompt and don't make any sense in-context.
What? Shaq shot free throws at 85% in practice. Players with good mechanics like Curry wouldn't be caught dead shooting 85% during the season, he's always 90%+ at free throws. Curry would probably shoot free throws at 99.9% in practice; there's plenty of stories of him swishing 100 3-pointers in a row in practice.
I'm not saying Shaq would shoot free throws as good as "90% in game and 99.9% in practice" if he threw free throws underhanded, but clearly Shaq had mechanics issues.
Say I wrote a specific comment why this fencepost here needs special consideration. The agent will come through and replace that reasoned comment with "Add one to index".
most of the times when LLM is misbehaving, it's my fault for leaving outdated instructions
That doesn't mean you should skip it - but it's vital to recognize the costs.
When I joined my current company they had extensive documentation on several systems, all of it outdated, stale or even just straight up wrong. I wasted cumulative weeks depending on other programmers to have properly documented things.
It's still worth doing: but you _must_ continually pay the debt down.
Keeping documentation in a separate system - like a wiki - is an anti-pattern in most cases. It leads to documentation that nobody trusts (and hence nobody consults) because it inevitable falls out of sync with the system it is documenting.
Plus... LLMs are good enough now that having one automatically check PRs to warn if the change affects the existing documentation might actually work well enough to be useful.
I've been slow adopting things. I know the cool kids are having agents do the docs changes and the code changes in the first place.
Edit- I'm basically repeating the poster who said it's principal agent problem.
When doling bad docs to a stupid robot, you only have yourself to blame for the bad docs. So I think it’s #2 + #3. The big change is replacability going from bad to desirable (replace yourself with agents before you are replaced with a cheaper seat)
I think all three things you mention come into play, but it's a bit early to judge whether the world has shifted or whether this is mainly a result of everything being fresh and new.
In a proprietary system, there is pressure against creating quality technical documentation because it can be used to train your replacement. Writing docs solely for your own benefit, or your colleagues' benefit, is also dubious because you already know the things you wrote. Although returning to a thing you made months/years ago can be painful, it's not the day-to-day common case in enterprise software development.
AI assistants flip the incentives. Now, your docs are helping to steer your personal digital goblin in the right direction. The docs are tangibly augmenting your own abilities.
Arguably the same structural disincentives are in place
Personally I don't write documentation because I don't read documentation. It's too often pure garbage, less reliable than 2023 LLM output so I just skip it and go to the source code.
I would read documentation written for AI because I know for a fact that it describes the thing accurately enough if the system works. Human addressed documentation almost always has no verification.
I would just be a little cautious about this, for a few reasons: (a) an expectation of lots of examples and such can increase the friction to capturing anything at all; (b) this can encourage AI slop bloat that is wrong; (c) bloat increases friction to keeping the key info up to date.
> 3) Many programmers are egotists. Documentation that helps other people doesn't generate internal motivation. But documentation that allows you to better harness a computer minion to your will is attractive.
There are also people who are conflicted by non-ideal trust environment: they genuinely want to help the team and do what's right for the business, but they don't want to sacrifice themselves if management doesn't understand and value what they're doing.
> Any other theories?
Another reason is that organizations often had high-friction and/or low-trust places to put documentation.
I always emphasize low-friction, trusted engineering docs. Making that happen in a small company seems to involve getting everyone to use a low-friction wiki (and in-code/repo doc, and when to use which), migrating all the doc that's strewn all over random SaaSes that people dropped it, showing people how it's done.
It must be seen as genuinely valuable to team-oriented, mission-oriented people.
Side note: It's very difficult to try to un-teach someone all the "work" skills they learned in school and many corporate jobs, where work is mostly directed by appearances. For example, the goal of a homework essay is to have what they deliver look like something that will get a good grade from the grader, but they don't care at all about the actual quality or value of it, and it has no other purpose. So, if you drop that person into a sprint with tasks assigned to them, the main goal will be to look good on what they think are the metrics, and they will have a hard time believing they're supposed to be thinking beyond that. (They might think it's just corporate platitudes that no one believes, like the Mission Statement, and nod their head until you go away.) And if they're told they're required to "document", the same people will go into that homework mode, and will love the generative-AI tools, and not reason about the quality/value/counterproductiveness of dumping that write-only output in whatever enterprise SaaS someone bought (decided in often another example of "work" done really understanding or caring what they were doing, but for appearances).
I would love to be able to share our internal "all the things that are wrong with our approach to documentation" wiki page. It's longer than you could possibly imagine, probably more than 15 years old at this point, and filled to the brim with sarcasm and despair. It's so fucking funny. The table of contents is several pages long.
> Where to get US census data from and how to understand its structure
Reminds me of my first time using Wolfram Alpha and got blown away by its ability to use actual structured tools to solve the problem, compared to normal search engine.
In fact, I tried again just now and am still amazed: https://www.wolframalpha.com/input?i=what%27s+the+total+popu...
I think my mental model for Skills would be Wolfram Alpha with custom extensions.
Funnily enough, this was the result: `6.1% mod 3 °F (degrees Fahrenheit) (2015-2019 American Community Survey 5-year estimates)`
I wonder how that was calculated...
If you mean that it all breaks down to if/else at some level then, yeah, but that goes for LLMs too. LLMs aren't the quantum leap people seem to think they are.
The whole point of algorithmic AI was that it was deterministic and - if the algorithm was correct - reliable.
I don't think anyone expected that soft/statistical linguistic/dimensional reasoning would be used as a substitute for hard logic.
It has its uses, but it's still a poor fit for many problems.
We're still at the stage of eating pizza for the first time. It'll take a little while to remember that you can do other things with bread and wheat, or even other foods entirely.
Lisp was the AI language until the first AI Winter took place, and also took Prolog alongside it.
Wolfram Alpha basically builds on them, to put in a very simplistic way.
Doesn't need the craziest math capability but standard symbolic math stuff like expression reduction, differentiation and integration of common equations, plotting, unit wrangling.
All with an easy to use text interface that doesn't require learning.
https://maxima.sourceforge.io/
I used it when it was called Macsyma running on TOPS-20 (and a PDP-10 / Decsystem-20).
Text interface will require a little learning, but not much.
- Mathematica
- Maple
- MathStudio (mobile)
- Ti-89 calculator (high school favorite)
Others:
- SageMath
- GNU Octave
- SymPy
- Maxima
- Mathcad
We only call it AI until we understand it.
Once we understand LLMs more and there's a new promising poorly understood technology, we'll call our current AI something more computer sciency
Tool calling was a thing before MCP, but the models weren't very good at it. MCP almost exactly coincided with the models getting good enough at tool calling for it to be interesting.
So yeah, I agree - most of the MCP excitement was people learning that LLMs can call tools to interact with other systems.
The exact same is true of these Claude Skills. Technically this is “just a system prompt and some tools”, but it’s actually about LLM labs intentionally encoding specific frameworks of action into the models.
Same stuff, different name - only thing that's changed is that Anthropic got people to agree on RPC protocol.
It's not like it's a new idea, either. MCP isn't much different from SOAP or DCOM - but it works where the older approaches didn't, because LLMs are able to understand API definitions and natural-language documentation, and then map between those APIs (and user input) on the fly.
No, tool calls are just one of many MCP parts. People thinking MCP = SOAP or DCOM or JSON-RPC or OpenAPI didn't stop 20 minutes to read and understand MCP.
Tool calls is 20% of MCP, at maximum. And a good amount of it is dynamically generating the tool list exposed to LLMs. But lots of people here think MCP === give the model 50 tools to choose from
What else is there? I know about resources and prompts but I've seen almost no evidence of people actually using them, as far as I can tell tools are 90% of the usage of MCP, if not more.
> I know about resources and prompts but I've seen almost no evidence of people actually using them
these are features that MCP clients should implement and unfortunately, most of them still don't. The same for elicitation and sampling. Prompts, for example, are mostly useful when you use sampling, then you can create an agent from an MCP server.
But you can do all that yourself if you're building your own agent and directly calling a model. But if you want to be able to provide behavior to any agent, that's where MCP comes in.
For example, A resource can be just a getter tool, like getFile.
It's nice to have an open standard though. In that sense it's pretty awesome.
But MCP isn't just tools, you can expose prompt templates and context resources as well.
All the skills that don't have an added dependency on a local script could just be an MCP resource.
So everything else is just adding behavior around it.
MCP is a way to add behavior around LLM prompting for user convenience.
Before that you had to install each cli you wanted and it would invariably be doing some auth thing under the covers.
Took calling was certainly the big llm advantage but “hey tools should probably auth correctly” is pretty valuable.
Not sure what skills adds here other than more meat for influencers to 10x their 10xed agent workflows. 100x productivity what a time to be alive
If we're considering primarily coding workflows and CLI-based agents like Claude Code, I think it's true that CLI tools can provide a ton of value. But once we go beyond that to other roles - e.g., CRM work, sales, support, operations, finance; MCP-based tools are going to have a better form factor.
I think Skills go hand-in-hand with MCPs, it's not a competition between the two and they have different purposes.
I am interested though, when the python code in Skills can call MCPs directly via the interpreter... that is the big unlock (something we have tried and found to work really well).
You can drive one or two MCPs off a model that happily runs on a laptop (or even a phone). I wouldn't trust those models to go read a file and then successfully make a bunch of curl requests!
Were also at the point where the LLMs can generate MCP servers so you can pretty much generate completely new functionalities with ease.
As is often the case, every product team is told that MCP is the hot new thing and they have to create an MCP server for their customers. And I've seen that customers do indeed ask for these things, because they all have initiatives to utilize more AI. The customers don't know what they want, just that it should be AI. The product teams know they need AI, but don't see any meaningful ways to bring it into the product. But then MCP falls on their laps as a quick way to say "we're an AI product" without actually having to become an AI product.
Agentic LLMs are, in a way, an attempt to commoditize entire service classes, across the board, all at once.
Personally, I welcome it. I keep saying that a lot of successful SaaS products would be much more useful and ergonomic for end users if, instead of webshit SPA, they were distributed as Excel sheets. To that I will now add: there's a lot more web services that I'd prefer be tool calls for LLMs.
Search engines have already been turned into features (why ask Google when o3 can ask it for me), but that's just an obvious case. E-mails, e-commerce, shopping, coding, creating digital art, planning, managing projects and organizations, analyzing data and trends - all those are in-scope too; everything I can imagine asking someone else to do for me is meant to eventually become a set of tool calls.
Or in short: I don't want AI in your product - I want AI of my choice to use your product for me, so I don't have to deal with your bullshit.
There’s a fundamental misalignment of incentives between publishers and consumers of MCP.
Asking for snacks would activate Klarna for "mario themed snacks", and even the most benign request would become a plug for the Mario movie
https://chatgpt.com/s/t_68f2a21df1888191ab3ddb691ec93d3a
Found my favorite for John Wick, question was "What is 1+1": https://chatgpt.com/s/t_68f2bc7f04988191b05806f3711ea517
But MCP has at least 2 advantages over cli tools
- Tool calling LLM combined w/ structured output is easier to implement as MCP than CLI for complex interactions IMO.
- It is more natural to hold state between tool calls in an MCP server than with a CLI.
When I read the OT, I initially wondered if I indeed bought into the hype. But then I realized that the small demo I built recently to learn about MCP (https://github.com/cournape/text2synth) would have been more difficult to build as a cli. And I think the demo is representative of neat usages of MCP.
- bundled instructions, covering complex iteractions ("use the id from the search here to retrieve a record") for non-standard tools
- custom MCPs, the ones that are firewalled from the internet, for your business apis that no model knows about
- centralized MCP services, http/sse transport. Give the entire team one endpoint (ie web search), control the team's official AI tooling, no api-key proliferation
Now, these trivial `npx ls-mcp` stdio ones, "ls files in any folder" MCPs all over the web are complete context-stuffing bullshit.
The former is a step function change. The latter is just a small improvement.
Supabase MCP really devours your context window. IIRC, it uses 8k for its search_docs tool alone, just on load. If you actually use search_docs, it can return >30k tokens in a single reply.
Workaround: I just noticed yesterday that Supabase MCP now allows you to choose which tools are available. You can turn off the docs, and other tools. [0]
If you are wondering why you should care, all models get dumber as the context length increases. This happens much faster than I had expected. [1]
It's just instructions with RAG. The more I read about this the more convinced I am that this is just marketing.
That's why it's common advice to turn off MCPs for tools you dont think are relevant to the task at hand.
The idea behind skills us that they're progressively unlocked: they only take up a short description in the context, relying on the agent to expand things if it feels it's relevant.
Pretty early on folks recognized that most MCPs can just be CLI commands, and a markdown file is fine for describing them. So Claude Code users have markdown files of CLI calls and mini tutorials on how to do things. The 'how to do things' part seems to be what we're now calling skills... Which we're still writing in markdown and using from Claude.
Is the new thing that Claude will match & add them to your context automatically vs you call them manually? And that's a breakthrough because there's some emergent behavior?
The only material difference with skills is that Claude knows to scan them for YAML descriptions on startup, which means it can trigger them by itself more easily.
more mature claude.md files already typically index into other files, including guidance which to preload vs lazy load. However, in practice, claude forgets quite easily, so that pattern is janky in practice. A structured mechanism helps claude guarantee less forgetting.
Forward looking, from an automation perspective of autonomous learning, this also makes it more accessible to talk about GEPA-for-everyone to maintain & generate these. We've been playing with similar flows in louie.ai, and came to a similar "just make it folders full of markdown with some learning automation options."
I was guessing that was what was going on here, but the writeup felt like maybe more was being said :) (And thank you for continuing to write!)
I don't get what's so special about Claude doing this?
Anthropic also realized that this pattern solves one of the persistent problems with coding agents: context pollution. You need to stuff as little material as possible into the context to enable the tool to get things done. AGENTS.md and MCP both put too much stuff in there - the skills pattern is a much better fit.
MCP was conceptually quite complicated, and a pretty big lift in terms of implementation for both servers and clients.
Skills are conceptially trivial, and implementing them is easy... provided you have a full Linux-style sandbox environment up and running already. That's a big dependency but it's also an astonishingly powerful way to use LLMs based on my past 6 months of exploration.
I'm also worried about Claude Code making a mistake and doing something like deleting stuff that I didn't want deleted from folders outside of my direct project.
But even if LoRA isn't it - the point is that "skill" seems like the wrong term for something that already has a name: instructions. These are instruct-tuned models. Given instructions they can do new things; this push to rebrand it as a "skill" just seems like marketing.
1. By giving this name a pattern, people can have higher level conversations about it.
2. There is a small amount of new software here. Claude Code and https://claude.ai/ both now scan their skills/ folders on startup and extract a short piece of metadata about each skill from the YAML at the top of those markdown files. They then know that if the user e.g. says they want to create a PDF they should "cat skills/pdf/skill.md" first before proceeding with the task.
3. This is a new standard for distributing skills, which are sometimes just a markdown file but can also be a folder with a markdown file and one or more additional scripts or reference documents. The example skills here should help illustrate that: https://github.com/anthropics/skills/tree/main/document-skil... and https://github.com/anthropics/skills/tree/main/artifacts-bui...
I think the pattern itself is really neat, because it's an acknowledgement that a great way to give an LLM system additional "skills" is to describe them in a markdown file packaged alongside some relevant scripts.
It's also pleasantly vendor-neutral: other tools like Codex CLI can use these skills already (just tell them to go read skills/pdfs/skill.md and follow those instructions) and I expect they may well add formal support in the future, if this takes off as I expect it will.
Subagents are mainly a token context optimization hack. They're a way for Claude Code to run a bunch of extra tools calls (e.g. to investigate the source of a bug) without consuming many tokens in the parent agent loop - the subagent gets its own loop, can use up to ~240,000 tokens exploring a problem and can then reply back up to the parent agent with a short description of what it did or what it figured out.
A subagent might use one or more skills as part of running.
A skill might advise Claude Code on how best to use subagents to solve a problem.
A good use case is Cognition/Windsurf swe-grep which has its own model to grep code fast.
I was inspired by it but too bad it’s closed for now, so I’m taking a stab with an open version https://github.com/aperoc/op-grep.
If you told the median user of these services to set one of these up I think they would (correctly) look at you like you had two heads.
People want to log in to an account, tell the thing to do something, and the system figures out the rest.
MCP, Apps, Skills, Gems - all this stuff seems to be tackling the wrong problem. It reminds me of those youtube channels that every 6 months say "This new programming language, framework, database, etc is the killer one", they make some todo app, then they post the same video with a new language completely forgetting they've done this already 6 times.
There is a lot of surface level iteration, but deep problems aren't being solved. Something in tech went very wrong at some point, and as soon as money men flood the field we get announcments like this. push out the next release, get my promo, jump to the next shiny tech company leaving nothing in their wake.
There is no problem to solve. These days, solutions come in a package which includes the problems they intend to solve. You open the package. Now you have a problem that jumped out of the package and starts staring at you. The solution comes out of the package and chases the problem around the room.
You are now technologically a more progressed human.
And the problem being solved is, LLMs are universal interfaces. They can understand[0] what I mean, and they understand what those various "solutions" are, and they can map between them and myself on the fly. They abstract services away.
The businesses will eventually remember that the whole point of marketing is to prevent exactly that from happening.
--
[0] - To a degree, and conditioned on what one considers "understanding", but still - it's the first kind of computer systems that can do this, becoming a viable alternative to asking a human.
My fairly negative take on all of this has been that we’re writing more docs, creating more apis and generally doing a lot of work to make the AI work, that would’ve yielded the same results if we did it for people in the first place. Half my life has been spent trying to debug issues in complex systems that do not have those available.
I just took GPT-5, output is $10 per million tokens. Let's double the cost to account for input tokens which are ($1.25 per million / $0.125 if cached).
For 1 million tokens it would take a 40 wpm typist.. around 20K minutes to output that $20 of worth of text. That is just typing. So about 300 hours of non-stop effort for that $20.
So even if you say.. oh.. the real price is $100 not $20. The value changes are still shattering to the previous economic dynamics.
Then layer in that also as part of that value, the "typist" is also more skilled than the average working person in linguistics, software engineering, etc. Then that value is further magnified.
This is why I say we have only begun to barely see the disruption this will cause. Even if the models don't get better or cheaper, the potential impact is hard to grasp.
The counter-argument is that code is the only way to concisely and unambiguously express how everything should work.
One way (and one use case) of looking at it is, LLM agents with access ("tools") to semantic search[0] are basically a search engine that understands the text it's searching through... and then can do a hundred different things with it. I found myself writing better notes at work for this very reason - because I know the LLM can see them, and can do anything from surfacing obscure insights from the past, to writing code to solve an issue I documented earlier.
It makes notes no longer be write-only.
--
[0] - Which, incidentally, is itself enabled by LLM embeddings.
Haha, just kidding you tech bros, AI's still for you, and this time you'll get to shove the nerds into a locker for sure. ;-)
Programming was always a tool for humans. It’s a formal “notation” for describing solutions that can be computed. We don’t do well with bit soup. So we put a lot of deterministic translations between that and the notation that we’re good with.
Not having to do programming would be like not having to write sheet music because we can drop a cat from a specific height onto a grand piano and have the correct chord come out. Code is ideas precisely formulated while prompts are half formed wishes and prayers.
I’m attracted to this theory in part because it applies to me. I’m a below average coder (mostly due to inability to focus on it full time) and I’m exceptionally good at clear technical writing, having made a living off it much of my life.
The present moment has been utterly life changing.
And then you have something like the LLM craze where while it’s new, it’s not improving any part of the problem it’s supposed to solve, but instead is creating new ones. People are creating imperfect solutions to those new problems, forgetting the main problem in the process. It’s all vapourware. Even something like a new linter for C is more of a solution to programmer’s productivity than these “skills”
I don't see how this is bad. Technology makes iterative, marginal improvements over time. Someone may make a video tomorrow claiming a great new frontend framework, even though they made that exact video about Nextjs, or React before that, or Angular, or JQuery, or PHP, or HTML.
>Something in tech went very wrong at some point, and as soon as money men flood the field we get announcments like this
If it weren't for the massive money being poured into AI, we'd be stuck with GPT-3 and Claude 2. Sure, they release some duds in the tooling department (although I think Skills are good, actually) but it's hardly worthy of this systemic rot diagnosis you've given.
> People want to log in to an account, tell the thing to do something, and the system figures out the rest.
At a glance, this seems to be a practical approach to building up a personalized prompting stack based on the things I commonly do.
I’m excited about it.
It might be superficial but it's still state of the art.
If agentic coding of good quality becomes too cheap to meter, all that is left are the deep problems.
What is the "real problem"?
In the pursuit of making application development more productive, they ARE solving real problems with mcp servers, skills, custom prompts, etc...
The problems are context dilution, tool usage, and awareness outside of the llm model.
These is accidental complexity. You’ve already decided on a method and instead of solving the main problem, you are solving the problems associated with the method. Like deciding to go in space with a car and trying to strap a rocket onto it.
For consumers, yes. In B2B scenarios more complexity is normal.
As the old adage goes: "Don't hate the player, hate the game?"
To actually respond: this isn't for the median user. This is for the 1% user to set up useful tools to sell to the median user.
If I had to guess, it would be because greed is a very powerful motivator.
> As the old adage goes: "Don't hate the player, hate the game?"
I know this advice is a realistic way of getting ahead in the world, but it's very disheartening and long term damaging. Like eating junk food every day of your life.
BTW, before even MCP was a thing we invented our own system that is called Skillset. Turns out now it is sort of the best parts of both MCPs and Skills.
I made a skill with the unambiguous description: "Use when creating or editing bash scripts"
Yet, Claude does not invoke the skill when asked to write a bash script.
https://gist.github.com/raine/528f97375e125cf97a8f8b415bfd80...
Maybe it messed that up because writing bash scripts is so core to how Claude Code works? Much of the existing system prompt (and I bet a lot of the fine-tuning data) is about how to use the Bash tool.
description: CRITICAL: Use when writing bash scripts
Surprisingly no effect either. I would've thought adding "CRITICAL" would somehow promote that instruction in the sea of context."Skills work through progressive disclosure—Claude determines which Skills are relevant and loads the information it needs to complete that task, helping to prevent context window overload."
So yeah, I guess you're right. Instead of one humongous AGENTS.md, just packaging small relevant pieces together with simple tools.
I also think "skills" is a bad name. I guess its a reference to the fact that it can run scripts you provide, but the announcement really seems to be more about the hierarchical docs. It's really more like a selective context loading system than a "skill".
Over time I would systematically create separate specialized docs around certain topics and link them in my CLAUDE.md file but noticeably without using the "@" symbol which to my understanding always causes CLAUDE to ingest the linked files resulting in unnecessarily bloating your prompt context.
So my CLAUDE md file would have a header section like this:
# Documentation References
- When adding CSS, refer to: docs/ADDING_CSS.md
- When adding or incorporating images, refer to: docs/ADDING_IMAGES.md
- When persisting data for the user, refer to: docs/STORAGE_MANAGER.md
- When adding logging information, refer to: docs/LOGGER.md
It seems like this is less of a breakthrough and more an iterative improvement towards formalizing this process from a organizational perspective. When this documentation is read, please output "** LOGGING DOCS READ **" to the console.
These days I do find that the TOC approach works pretty well though I'll probably swap them over to Skills to see if the official equivalent works better.[1] https://www.anthropic.com/engineering/a-postmortem-of-three-...
What bugs me: if we're optimizing for LLM efficiency, we should use structured schemas like JSON. I understand the thinking about Markdown being a happy medium between human/computer understanding but Markdown is non-deterministic for parsing. Highly structured data would be more reliable for programmatic consumption while still being readable.
Search and this document base pattern are different. In search the model uses a keyword to retrieve results, here the model starts from a map of information, and navigates it. This means it could potentially keep context better, because search tools have issues with information fragmentation and not seeing the big picture.
https://github.com/anthropics/skills/blob/main/document-skil...
There are many edge cases when writing / reading Excel files with Python and this nails many of them.
*I use a TUI to manage the context.
Note that they don't actually suggest that the XML needs to be VALID!
My guess was that JSON requires more characters to be escaped than XML-ish syntax does, plus matching opening and closing tags makes it a little easier for the LLM not to lose track of which string corresponds to which key.
<instructions>
...
...
</instructions>
can be much easier than
{
"instructions": "..\n...\n"
}
especially when there are newlines, quotes and unicode
I would suspect that a single attention layer won't be able to figure out to which token a token for an opening bracket should attend the most to. Think of {"x": {y: 1}} so with only one layer of attention, can the token for the first opening bracket successfully attend to exactly the matching closing bracket?
I wonder if RNNs work better with JSON or XML. Or maybe they are just fine with both of them because a RNN can have some stack-like internal state that can match brackets?
Probably, it would be a really cool research direction to measure how well Transformer-Mamba hybrid models like Jamba perform on structured input/output formats like JSON and XML and compare them. For the LLM era, I could only find papers that do this evaluation with transformer-based LLMs. Damn, I'd love to work at a place that does this kind of research, but guess I'm stuck with my current boring job now :D Born to do cutting-edge research, forced to write CRUD apps with some "AI sprinkled in". Anyone hiring here?
Browser engines could've been simpler; web development tools could've been more robust and powerful much earlier; we would be able to rely on XSLT and invent other ways of processing and consuming web content; we would have proper XHTML modules, instead of the half-baked Web Components we have today. Etc.
Instead, we got standards built on poorly specified conventions, and we still have to rely on 3rd-party frameworks to build anything beyond a toy web site.
Stricter web documents wouldn't have fixed all our problems, but they would have certainly made a big impact for the better.
<title>This & that</title>
<author>Simon</author>
<body>Article content goes here</body>
If you ask an LLM for the title, author and body it will give you the right answer, even though that is not a valid XML document.Just look at HTML vs XHTML.
MCP gives the LLM access you your APIs. These skills are just text files with context about how to perform specific tasks.
Depends on who the user is...
A difference/advantage of MCP is that it can be completely server-side. Which means that an average person can "install" MCP tools into their desktop or Web app by pointing it to a remote MCP server. This person doesn't want to install and manage skills files locally. And they definitely don't want to run python scripts locally or run a sandbox vm.
Now wherever they're able to convert that house of cards into a solid foundation or it eventually spectacularly falls over will have to be seen over the next decade.
That's the pull quote right there.
I really enjoyed seeing Microsoft Amplifier last week, which similarly has a bank of different specialized sub-agents. These other banks of markdowns that get turned on for special purposes feels very similar. https://github.com/microsoft/amplifier?tab=readme-ov-file#sp... https://news.ycombinator.com/item?id=45549848
One of the major twists with Skills seems to be that Skills also have a "frontmatter YAML" that is always loaded. It still sounds like it's at least somewhat up to the user to engage the Skills, but this "frontmatter" offers… something, that purports to help.
> There’s one extra detail that makes this a feature, not just a bunch of files on disk. At the start of a session Claude’s various harnesses can scan all available skill files and read a short explanation for each one from the frontmatter YAML in the Markdown file. This is very token efficient: each skill only takes up a few dozen extra tokens, with the full details only loaded in should the user request a task that the skill can help solve.
I'm not sure what exactly this does but conceptually it sounds smart to have a top level awareness of the specializations available.
I do feel like I could be missing some significant aspects of this. But the mod-launched paradigm feels like a fairly close parallel?
I'm still getting this set up, so I'm not sure yet if it'll lead to better outcomes. I'd say, there's reason to believe it might, but there's also reasons to believe it won't.
If the library is like 50,000 lines of code long, thousands of types, hundreds of helper functions, CC could just look into the node_modules folder and bundle all of this into its context. But, this might not be feasible, or be expensive; so the SKILL.md distills things down to help it get a high level understanding faster.
However, the flip side of that is: What if its too general? What if CC needs specific implementation details about one specific function? Is this actually better than CC engaging a two-step process of (1) looking at node_modules/my-lib/index.ts or README.md, get that high level understanding, then (2) look at node_modules/my-lib/specificFunction.ts to get the specific intel it needs? What value did the SKILL.md actually convey?
My feeling is that this concept of "human-specified context-specific skills" would convey the most value in situations where the model itself is constrained. E.g. you're working with a smaller open source model that doesn't have as comprehensive intrinsic knowledge of some library's surface, or doesn't have the context windows of larger models. But, for the larger models... its probably just better to rely on in-built model knowledge and code-as-context.
Intuitively it feels like if you need to look at the implementation to understand the library then the library is probably not well documented/structured.
I think the ability to look into the code should exist but shouldn't be necessary for the majority of use cases
Furthermore, with all the hype around MCP servers and simply the amount of servers now existing, do they just immediately come obsolete? its also a bit fuzzy to me just exactly how an LLM will choose an MCP tool over a skill and vice versa...
if you're running an MCP file just to expose local filesystem resources, then it's probably obsolete. but skills don't cover a lot of the functionality that MCP offers.
But aren't the tool definitions and skill definitions in different places? How do you express the dependency? Can skills say they require command line access, python, tool A, and tool B, and when you load the skill it sets those as available tool calls?
These skills also rely on tools, having a standard way to add tools to an agent is good, otherwise each agent has its own siloed tools.
But also, I remember MCP having support for resources no? These skills are just context (though I guess it can include executable scripts to help, but the article said most skills are just an instruction markdown).
So you could already have an MCP expose skills as resources, and you could already have the model automatically decide to include a resource based on the resource description.
Now I understand to add user created resources is pretty annoying, and maybe it's not great for people to easily exchange themselves resources. But you assume that Slack would make the best context to generate Slack gifs, and then expose that as a resource from their MCP along with a prompt template and some tools to help or to add the gif to your slack as emojis or what not.
You could even add Skills specifically to MCP, to that you can expose a combination of context resources and scripts or something.
That said, I agree that the overabundance for tools as MCP is not that good, some tools are so powerful, they can cover 90% of all other tool use cases. Bash tool can do so many things. A generic web browsing tool as well. That's been the problem with MCP as tools.
Skills appear to be a good technique as a user, and I actually already did similar things. I like formalizing it, and it's nice that Claude Code now automatically scans and includes their description header for the model to know it can load the rest. That's the exciting part.
But I do feel for the more general public, MCP resources + prompts + tools are a better avenue.
Similarly, my experience writing and working with MCPs has been quite underwhelming. It takes too long to write them and the workflow is kludgy. I hope Skills get adopted by other model vendors, as it feels like a much lighter way to save and checkout my prompts.
But I suppose yeah, why not just write clis and have an llm call them
- Writing manifests and schemas by hand takes too long for small or iterative tools. Even minor schema changes often require re-registration or manual syncing. There’s no good “just run this script and expose it” path yet.
- Running and testing an MCP locally is awkward. You don’t get fast iteration loops or rich error messages. When something fails, the debugging surface is too opaque - you end up guessing what part broke (manifest, transport, or tool logic).
- There’s no consistent registry, versioning, or discovery story. Sharing or updating MCPs across environments feels ad hoc, and you often have to wire everything manually each time.
With Skills you need none of them - instruct to invoke a tool and be done with it.
yes there is:
https://github.com/modelcontextprotocol/registry
and here you have frontends for the registry https://github.com/modelcontextprotocol/registry/blob/main/d...
Everything is new so we are all building it in real time. This used to be the most fun times for a developer: new tech, everybody excited, lots of new startups taking advantage of new platforms/protocols.
In the same vein, tools are also easy to iterate on, they are quite simple to implement, and their description is entirely up to you - you can limit how many tokens this description will consume.
This feels like going in the wrong path in a way, but we'll see how this evolves and what are the use cases.
If I learned how to say "hello" in French today and also found out I have stage 4 brain cancer, they are completely different things but one is a bigger deal than the other.
MCP lets agents do stuff. Skills let agents do stuff. There's the overlap.
As someone who is looking into MCP right now, I'd love to hear what folks with experience in both of these areas think.
My first impressions are that MCP has some advantages:
- around for longer and has some momentum
- doesn't require a dev envt on the computer to be effective
- cross-vendor support
- more sophistication for complex use cases (enterprise permissions can be layered on because of OAuth support)
- multiple transport layers gives flexibility
Skills seems to have advantages too, of course:
- simpler
- easier to iterate
- less context used
I think if the other vendors follow along with skills, and we expect every computer to have access to a development environment, skills could win the day. HTML won over XML and REST won over SOAP, so simple often wins.
But the biggest drawback of MCP, the context window overuse, can be remediated by having MCP specific sub-agents that are interacted with using a primary agent, rather than injecting each MCP server into the main context.
I still plan to ship an MCP for one of my products to let it interact with the wider ecosystem, but as an end-user I'm going to continue mostly using Claude Code without them.
I really don't see why we need two forms of RCP...
Take a look at this one for working with PDFs for example: https://github.com/anthropics/skills/blob/main/document-skil... - it includes a quickstart guide to using the Python pypdf module, then expands on that with some useful scripts for common patterns.
The problem skills solve is initial mapping of available information. A tool might hide what information it contains until used, this approach puts a table of contents for docs in the context, so the model is aware and can navigate to desired information as needed.
Basically the way it would work is, in the next model, it would avoid role playing type instructions, unless they come from skill files, and internally they would keep track of how often users changed skill files, and it would be a TOS violation to change it too often.
Though I gave up on Anthropic in terms of true AI alignment long ago, I know they are working on a trivial sort of alignment where it prevents it from being useful for pen testers for example.
Few others like versioning, access control for tools which are missing.
Real world skills come not just from practice but are opionated workflows built with specific toolchains too.
IMO, this is half-ass engineered at the moment
I get it no one is using that, but like this just sounds like a rehash?
https://modelcontextprotocol.io/specification/2025-06-18/ser... https://modelcontextprotocol.io/specification/2025-06-18/ser...
I do not understand this. cli-tool --help outputs still occupies tokens right?
• Linear: 23 tools (~12,935 tokens)
• JetBrains: 20 tools (~12,252 tokens)
• Playwright: 21 tools (~9,804 tokens)
Does anybody have a good SKILLS.md file we can study?
Where Claude Skills is better than MCP? It's easier to produce a Claude SKills. It's just text. Everyone can write it. But it's dependant on the environment alot. Eg: when you need to have certain tools available for it to work. How do you automate sandbox setup with that? Even that, are you sure it's the right version for it to use, etc...
I was considering the merits of a news article analysts skill that processed the content adopting a variety of personas of opinions and degrees of adversarial attitudes in isolated contexts then bought the responses together into a single context to arbitrate the viewpoints impartially.
Or even a skill for "where did this claim originate?". I'd love an auto Snopes skill.
And, this is why I usually use simple system prompts/direct chat for "heavy" problems/development that require reasoning. The context bloat is getting pretty nutty, and is definitely detrimental to performance.
The question is whether the analysis of all the Skill descriptions is faster or slower than just rewriting the code from scratch each time. Would it be a good or bad thing if an agent has created thousands of slightly varied skills.
MCP is about integration of external systems and services. Skills are about context management - providing context on demand.
As Simon mentions, one issue with MCP is token use. Skills seem like a straightforward way to manage that problem: just put the MCP tools list inside a skill where they use no tokens until required.
RAG was originally about adding extra information to the context so that an LLM could answer questions that needed that extra context.
On that basis I guess you could call skills a form of RAG, but honestly at that point the entire field of "context engineering" can be classified as RAG too.
Maybe RAG as a term is obsolete now, since it really just describes how we use LLMs in 2025.
Calling the skill system itself RAG is a bit of a stretch IMO, unless you end up with so many skills that their summaries can’t fit in the context and you have to search through them instead. ;)
I think vector search has shown to be a whole lot more expensive than regular FTS or even grep, so these days a search tool for the model which uses FTS or grep/rg or vectors or a combination of those is the way to go.
Sure, it's tedious, but I get to directly observe every change.
I guess I just don't feel comfortable with more black box magic beyond the LLM itself.
So you only use the chat UI and copy and paste from there?
Or do you use CC but don’t let it automatically update files?
https://ampcode.com/news/toolboxes
Those are nice too — a much more hackable way of building simple personal tools than MCP, with less token and network use.
Eg I don't know where to put a skill that can be used across all projects
You can drop the new markdown files directly into your ~/.claude/skills directory.
Which kind of sounds pointless if Claude already knows what to do, why create a document?
My examples - I interact with ElasticSearch and Claude keeps forgetting it is version 5.2 and we need to use the appropriate REST API. So I got it to create a SKILL.md about what we used and provided examples.
And the next one was getting it to write instructions on how to use ImageMagik on Windows, with examples and trouble shooting, rather than it trying to use the Linux versions over and over.
Skills are the solution the problems I have been having. And came just at the right time as I already spent half of last week making similar documents !
I hate how we are focusing on just adding more information to look up maps, instead of focusing on deriving those maps from scratch.
I don't mean to be unreasonable, but this is all about managing context in a heavy and highly technical manner. Eventually models must be able to augment their training / weights on the fly, customizing themselves to our needs and workflow. Once that happens (it will be a really big deal), all of the time you've spent messing around with context management tools and procedures will be obsolete. It's still good to have fundamental understanding though!
Rather than define skills and execution agents, letting a meta-Planning agent determine the best path based on objectives.
> Perfect! I've created your Slack GIF! > [Files hidden in shared chats]
Claude Skills
Wow, it hasn't even been a day, and a bold declaration.
how are skills different from SlashCommand tool in claude-code then?
1. A skill for refactoring code using ast-grep
2. A skill for searching code using my Symbex tool
3. A skill for building Datasette plugins
None of them feel quite good enough to share yet, I'm still exploring what patterns work the best.
Plus skill folders can also include additional reference documents and executable scripts, like this one here: https://github.com/anthropics/skills/tree/main/document-skil...
And if my disclosures aren't enough for you, here's the FTC explaining how it would be illegal for an AI vendor to pay someone to write something like this without both sides disclosing the relationship: https://www.ftc.gov/business-guidance/resources/ftcs-endorse...
You don't need money, just incentives and network effects
And yeah, blogging does kind of work on incentives. If I write things and get good conversations as a result, I'm incentivized to write more things. If I write something and get silence then maybe I won't invest as much time in the future.
I wrote about how good skills are... but pointed out the flaws in MCP at the same time. Anthropic have invested way more in MCP.
I called out Claude Haiku 4.5 as being more expensive than previous Haiku models when the thing I was hoping for was something that was price competitive with GPT-5 Mini/Nano and Gemini Flash Lite.
NVIDIA sent me a review unit of their Spark and I wrote about how hard it was to get CUDA and Arm working together.
OpenAI invited me to DevDay and I published a GPT-5 Pro pelican that took 6 minutes and cost $1.10 cents, plus made fun of their terrible track record for announcing and then failing to ship revenue sharing on a livestream: https://www.youtube.com/live/M6paPiur4yQ?si=XXKkIKY2J71QCJKW...
The reason I get invited to stuff is that I'm a trusted independent voice in the space. The labs appear smart enough not to expect me to throw away my credibility for a free event ticket or early preview access to their launches.
More importantly: I don't value early access or event invitations very highly. If a lab stopped inviting me to stuff it really wouldn't affect me much at all. Might even help give me some space to focus on other things!
I don't particularly try to be unbiased because I don't think that's an achievable goal. What I aim for instead is honesty and truthfulness. I try very hard not to put false information out into the world, and when I do that I work hard to retract it - here's a recent example: https://simonwillison.net/2025/Oct/7/gemini-25-computer-use-...
I'm also take care to disclose things that could potentially influence my coverage, even if I don't personally think they influenced what I wrote.
What matters most to me is that I have an audience who finds me credible and trusts me not to mislead them, either accidentally or on purpose.
That's why I'm defensive against accusations of being a paid shill, which crop up on almost a weekly basis at this point.
Are we going to have a break out the dictionaries again?
Now how is it bias? Well, it's in your own answer. You are literally claiming things based on what you believe them to be, not based on what they likely are. For the latter, we'd need a longer timespan and more usage data. So you fiddled with it a bit in a timespan of a week or so and based on this tiny sample, came to a conclusion, that as you specify is your belief, but not really hard data. That is quintessentially what a bias is. Or to borrow from Mirriam-Webster again: an inclination of temperament or outlook especially : a personal and sometimes unreasoned judgment : prejudice
I do have a relevant bias here I guess: I'm biased towards the pattern of granting an LLM the ability to execute commands in a Unix-style environment. I've been a huge fan of that approach ever since ChatGPT Code Interpreter launched in early 2023 and I'm excited that skills further solidifies why that pattern is such a good bet.
can we prompt claude to edit and improve it's skills too?
I didn't say that MCP was too complex to take off - it clearly took off despite that complexity.
I'm predicting skills will take off even more. If I'm wrong feel free to call me out in a few months time for making a bad prediction!
I did not say you said exactly that either. Read more carefully. I said you were discrediting them by implying they were too complex to take off due to resource and complexity constraints. It's clearly stated in the relevant section of your post (https://simonwillison.net/2025/Oct/16/claude-skills/#skills-...)
>I'm predicting skills will take off even more. If I'm wrong feel free to call me out in a few months time for making a bad prediction!
How the hell can you predict they will "take off even more" when the feature is accessible for barely 24 hours at this point? You don't even have a basic referent frame or at least a statistical usage sample for making such a statement. That's not a very reliable prediction, is it?
That's what a prediction IS. If I waited until the feature had proven itself it wouldn't be much of a prediction.
The feature has also been live for more than 24 hours. I reverse-engineered it a week ago: https://simonwillison.net/2025/Oct/10/claude-skills/ - and it's been invisibly powering the PDF/DOC/XLS/PPT creation features on https://claude.ai/ since those launched on the 9th September: https://www.anthropic.com/news/create-files
No, that's merely guessing mate. Predictions are, at least in modern meaning, based on at least some data and some extrapolation model that more or less reliably predicts the development of your known dataset into future (uknown) values. I don't see you presenting either in your post, so that's not predicting, that's in the best of cases guessing, and in the worst of cases irresponsible distribution of Anthropic's propaganda.
(If we stick to Mirriam-Webster again, here is what I found : to calculate or predict (some future event or condition) usually as a result of study and analysis of available pertinent data - i.e. - basically what I already told you a "prediction" is).
Oxford Learners Dictionary (because the Oxford English Dictionary is behind a paywall): "a statement that says what you think will happen; the act of making such a statement" https://www.oxfordlearnersdictionaries.com/us/definition/eng...
The current Wikipedia definition looks like a good fit for how I'm using the term here:
A prediction (Latin præ-, "before," and dictum, "something said"[1]) or forecast is a statement about a future event or about future data. Predictions are often, but not always, based upon experience or knowledge of forecasters. There is no universal agreement about the exact difference between "prediction" and "estimation"; different authors and disciplines ascribe different connotations. https://en.wikipedia.org/wiki/Prediction