LLMs seemed like the obvious fix — just throw the HTML at GPT and ask for JSON. Except in practice, it's more painful than that:
- Raw HTML is full of nav bars, footers, and tracking junk that eats your token budget. A typical product page is 80% noise. - LLMs return malformed JSON more often than you'd expect, especially with nested arrays and complex schemas. One bad bracket and your pipeline crashes. - Relative URLs, markdown-escaped links, tracking parameters — the "small" URL issues compound fast when you're processing thousands of pages. - You end up writing the same boilerplate: HTML cleanup → markdown conversion → LLM call → JSON parsing → error recovery → schema validation. Over and over.
We got tired of rebuilding this stack for every project, so we extracted it into a library.
Lightfeed Extractor is a TypeScript library that handles the full pipeline from raw HTML to validated, structured data:
- Converts HTML to LLM-ready markdown with main content extraction (strips nav, headers, footers), optional image inclusion, and URL cleaning - Works with any LangChain-compatible LLM (OpenAI, Gemini, Claude, Ollama, etc.) - Uses Zod schemas for type-safe extraction with real validation - Recovers partial data from malformed LLM output instead of failing entirely — if 19 out of 20 products parsed correctly, you get those 19 - Built-in browser automation via Playwright (local, serverless, or remote) with anti-bot patches - Pairs with our browser agent (@lightfeed/browser-agent) for AI-driven page navigation before extraction
We use this ourselves in production at Lightfeed, and it's been solid enough that we decided to open-source it.
GitHub: https://github.com/lightfeed/extractor npm: npm install @lightfeed/extractor Apache 2.0 licensed.
Happy to answer questions or hear feedback.
One pattern that's helped us: decomposing complex schemas into multiple simpler sequential extractions rather than one large schema. Less impressive as a demo, but noticeably more reliable in production when you're cost-optimizing with smaller models. The partial recovery approach here (keeping valid items even when one fails) is exactly the right instinct for keeping pipelines alive.
This might be one reason why Claude Code uses XML for tool calling: repeating the tag name in the closing bracket helps it keep track of where it is during inference, so it is less error prone.
Also, a model can always use a proxy to turn your tool calls into XML
And feed you back json right away and you wouldn't even know if any transformation did take place.
On XML vs JSON, I think the goal here is to generate typed output where JSON with zod shines - for example the result can type check and be inserted to database typed columns later
I've built an agent in both tool calling and by parsing XML
You always need a self correcting loop built in, if you are editing a file with LLM you need provide hints so LLM gets it right the second time or 3rd or n time.
Just by switching to XML you'll not get that.
I used to use XML now i only use it for examples in in system prompt for model to learn. That's all
I know multi path LLM approaches exist: e.g. generating JSON patches
We see this especially with arrays of objects where each object has optional nested fields. For complex nested objects, the model can get all items well formatted but one with an invalid field of wrong type. That's why we put effort into the repair/recovery/sanitization layer — validate field-by-field and keep what's valid rather than throwing everything out.
Then langchain and structured schemas for the output along w/ a specific system prompt for the LLM. Do you know which open source models work best or do you just use gemini in production?
Also, looking at the docs, Gemini 2.5 flash is getting deprecated by June 17th https://ai.google.dev/gemini-api/docs/deprecations#gemini-2.... (I keep getting emails from Google about it), so might want to update that to Gemini 3 Flash in the examples.
https://github.com/lightfeed/extractor/blob/main/src/convert...
I have used gemma 3 and had good results.
Once Gemini 3 flash drops the preview suffix, will update the examples. Thank you for the pointer.
Even Cloudflares bot filter only blocks some of them.
I'm using honeypot URLs right now to block all crawlers that ignore rel="nofollow", but they appear to have many millions of devices. I wouldn't be surprised if there are a gazillion residential routers, webcams and phones that are hacked to function as a simple doorways.
Things are really getting out of hand.
And it doesn't care about robots.txt.
Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
[0]: https://github.com/lightfeed/extractor/blob/d11060269e65459e...
I will add a PR to enforce robots.txt before the actual scraping.
Yes. It is. You've just made an arbitrary choice not to define it as such.
It may or may not be, but if you want people to actually use this product I’d suggest improving your documentation and replies here to not look like raw Claude output.
I also doubt the premise that about malformed JSON. I have never encountered anything like what you are describing with structured outputs.
price: z.number().optional() -> price: “n/a”
url: z.string().url().nullable() -> url: “not found”
It can also be one invalid object (e.g. missing required field, truncated input) in an array causing the entire output to fail.
The unique contribution here is we can recover invalid nullable or optional field, and also remove invalid nested objects in an array.
Our main use case is retail price monitoring — comparing publicly listed product prices across e-commerce sites, which is pretty standard in the industry. But fair point, we should make that clearer in the README.
Those prices and information is for the public viewers, the reason why some people have ROBOTS.txt for example is to reduce the traffic load that slop crawlers generate. The bandwidth is not free so why would you assume to ignore their ROBOTS.txt when you're not footing the bill ?