AI, build me a scraper
what do you want to scrape
[lists sites to scrape]
oh, I've already scraped those relentlessly, here ya go
Not fully equivalent to what is doing Skyvern, but still an interesting approach.
[1] https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...
Thanks for sharing!
That being said...
LLMS are amazing for some coding tasks and fail miserably at others. My hypothesis is that there is some sort of practical limit to how many concepts an LLM can hold into account no matter the context window given the current model architectures.
For a long time I wanted to find some sort of litmus test to measure this and I think I found one that is an easy to understand programming problem, can be done in a single file, yet complex enough. I have not found a single LLM to be able to build a solution without careful guidance.
I wrote more about this here if you are interested: https://chatbotkit.com/reflections/where-ai-coding-agents-go...
Plan for solving this problem:
- Build a comprehensive design system with AI models
- Catalogue the components it fails on (like yours)
- These components are the perfect test cases for hiring challenges (immune to “cheating” with AI)
- The answers to these hiring challenges can be used as training data for models
- Newer models can now solve these problems
- You can vary this by framework (web component / React / Vue / Svelte / etc.) or by version (React v18 vs React v19, etc.)
What you’re doing with this is finding the exact contours of the edge of AI capability, then building a focused training dataset to push past those boundaries. Also a Rosetta Stone for translating between different frameworks.
I put a brain dump about the bigger picture this fits into here:
https://jim.dabell.name/articles/2025/08/08/autonomous-softw...
One can see the results in a place where most code is terrible (data science is the place I see this most, as it's what I do mostly) but most people don't realise this. I assume this also happens for stuff like frontend, where I don't see the badness because I'm not an expert.
The tricky part is that I don't think all programming is formal logic at all, just a small part. And this thing with that different code is for different purposes really screws up LLMs reasoning process unless you make it really clear what code is for what.
Why do you say this? The foundation of all of computer science is formal logic and symbolic logic.
There is a reason a lot of programmers see programming having lots of similarities with painting and other creative activities.
The space is just so large that everyone has their own "basis" that sometimes even move with time. They can still be good programmers imo.
Yes, but also it has to deal with "the real world" which is only logical if you can encode a near infinite number of variables, instead we create leaky abstractions in order to actually get work done.
my own solution? 1.56 seconds. I consider myself to be at an intermediate skill level, and while LLMs are useful, they likely wont replace any but the least talented programmers. Even then i'd value human with critial thinking paired with an LLM over an even more competent LLM.
Just for curiosities sake, what language have you been trying to use?
this is not at all a sample of high-quality, well-educated-about-concurrency code, but it does roughly match a lot of Business™ code and also most less-mature open source code I encounter (which is most open source code). it's just not something most people are fluent with.
these same people using LLMs have generally produced much worse concurrent code, regardless of the model or their prompting-sophistication or thinking-time, unless it's extremely trivial (then it's slightly better, because it's at least tutorial-level correct) (and yes, they should have just used one of many pre-existing libraries in these cases). doing anything except like "5 workers on this queue plz" consistently ends up with major correctness flaws - often it works well enough while everything is running smoothly, but under pressure or in error cases it falls apart extremely badly... which is true for most "how to write x currently" blog posts I run across too - they're over-simplified to the point of being unusable in practice (e.g. by ignoring error handing) and far too inflexible to safely change for slightly different needs.
honestly I think it's mostly due to two things: a lack of quality training material (some obviously exists, but it's overwhelmed by flawed stuff), and an extreme sensitivity to subtle flaws (much more so than normal code). so it's both bad at generalizing (not enough transitional examples between targets), and its general lack of ability to actually think introducing flaws that look like normal code that humans are less likely to notice (due to their own lack of experience).
this is not to claim it's not possible to use them to write good concurrent code, there are many counter-examples that show it is. but it's a particularly error-prone area in practice, especially in languages without much safer patterns or built-in verification.
And then the third or fourth time its automatic. Its weird but sometimes I feel like the best way to make agents work is to metathink about how I myself work.
You don’t get that whole uncanny valley disconnect do you?
LLMs don't do this. They can't think. If you just one for like five minutes it's obvious that just because the text on the screen says "Sorry, I made I mistake, there are actually 5 r's in strawberry", doesn't mean there's any thought behind it.
If that's not thinking, then I don't know what is.
You can also take it a step further and add automatic fine-tuning once you start gathering a ton of data, which will rewire the model somewhat.
But in my mind, if I tell the LLM to do something, and it did it wrong, then I ask it to fix it, and if in the future I ask the same thing and it avoids the mistake it did first, then I'd say it had learned to avoid that same pitfall, although I know very well it hasn't "learned" like a human would, I just added it to the right place, but for all intents and purposes, it "learned" how to avoid the same mistake.
The person is the data that they have ingested and trained on through the senses that are exposed by their body. Body is just an interface to reality.
What? LLMs don't think nor learn in the sense humans do. They have absolutely no resemblance to a human being. This must be the most ridiculous statement I've read this year
What used to be a constant almost daily chore with them breaking all the time at random intervals is now a self-healing system that rarely ever fails.
I know the authors of Skyvern are around here sometimes -- How do you think about code generation with vision based approaches to agentic browser use like OpenAI's Operator, Claude Computer Use and Magnitude?
From my POV, I think the vision based approaches are superior, but they are less amenable to codegen IMO.
We can ask the vision based models to output why they are doing what they are doing, and fallback to code-based approaches for subsequent runs
If a website isn't using Cloudflare or a JS-only design, it's generally better to skip playwright. All the major AIs understand beautifulsoup pretty well, and they're likely to write you a faster, less brittle scraper.
At scale, dropping the heavier dependencies and network traffic of a browser is meaningful.
They aren't enough for anything that's login-protected, or requires interacting with wizards (eg JS, downloading files, etc)
That said I’d try it again but I don’t want to spend money again.
Is AI capable of saying, "This website sucks, and doesn't work - file a complaint with the webmaster?"
I once had similar problems with the CIA's World Factbook. I shudder to think what an I would do there.
Skyvern kept suggesting improvements unrelated to the issue they were testing for
The AI isn’t mad, and won’t refuse to renew. Unless it’s being run by the client of course.
Are clients using your platform to assess vendors?
While I cans see _some_ good uses for it, there are clearly abusive uses for it, including in their examples.
I mean jesus fuck, who wants cheap/free automation out there to "Skyvern can be instructed to navigate to job application websites like Lever.co and automatically generate answers, fill out and submit the job application."?
I already have to deal with enough totally unsuitable scattergun job applications every time we advertise an open position.
This is just asking to be used for abuse.