Show HN: Convert HTML DOM to semantic markdown for use in LLMs
146 points
1 month ago
| 19 comments
| github.com
| HN
mistercow
1 month ago
[-]
This is cool. When dealing with tables, you might want to explore departing from markdown. I’ve found that LLMs tend to struggle with tables that have large numbers of columns containing similar data types. Correlating a row is easy enough, because the data is all together, but connecting a cell back to its column becomes a counting task, which appears to be pretty rough.

A trick I’ve found seems to work well is leaving some kind of id or coordinate marker on each column, and adding that to each cell. You could probably do that while still having valid markdown if you put the metadata in HTML comments, although it’s hard to say how an LLM will do at understanding that format.

reply
michaelmior
1 month ago
[-]
SpreadsheetLLM[0] might be worth looking into. It's designed for Excel (and similar) spreadsheets, so I'd imagine you could do something far simpler for the majority of HTML tables.

[0] https://arxiv.org/abs/2407.09025v1

reply
leroman
1 month ago
[-]
reply
msnkarthik
1 month ago
[-]
You're spot on about the challenges LLMs face with complex markdown tables, especially when column counts rise and data types are similar. The "counting task" for column correlation is a real pain point – it's like the LLM loses track of where it is in the data grid. Your ID/coordinate marker idea is clever! It provides explicit context that LLMs seem to crave. Using HTML comments for this metadata is an interesting approach. It keeps the markdown valid for human readability, but I share your uncertainty about how consistently LLMs would parse and utilize it. Some other avenues worth exploring: Alternative Formats: Have you experimented with formats like CSV or JSON for feeding tabular data to LLMs? They might offer a more structured representation that's easier to parse. Pre-processing: Could we pre-process the table to create a more LLM-friendly representation? For example, converting it into a list of dictionaries, where each dictionary represents a row and keys represent column names. Prompt Engineering: Perhaps there are specific prompts or instructions that can guide LLMs to better handle large tables within markdown. It seems like there's room for innovation in how we bridge the gap between human-readable markdown tables and the structured data LLMs thrive on.
reply
mattding
1 month ago
[-]
Do you have any numbers re-markdown performance, or is this anecdotal? I'm running a similar experiment right now and would love to hear anything else you've tried.
reply
breck
1 month ago
[-]
ScrollSets work really well for using LLMs to generate tables: https://sets.scroll.pub/

ScrollSets are basically "deconstructed CSVs".

reply
leroman
1 month ago
[-]
Thanks for sharing, will look into adding this as a flag in the options!
reply
gmaster1440
1 month ago
[-]
> Semantic Clarity: Converts web content to a format more easily "understandable" for LLMs, enhancing their processing and reasoning capabilities.

Are there any data or benchmarks available that show what kind of text content LLMs understand best? Is it generally understood at this point that they "understand" markdown better than html?

reply
mistercow
1 month ago
[-]
I haven’t found any specific research, but I suspect it’s actually the opposite, particularly for models like Claude, which seem to have been specifically trained on XML-like structures.

My hunch is that the fact that HTML has explicit matching closing tags makes it a bit easier for an LLM to understand structure, whereas markdown tends to lean heavily on line breaks. That works great when you’re viewing the text as a two dimensional field of pixels, but that’s not how LLMs see the world.

But I think the difference is fairly marginal, and my hunch should be taken with a grain of salt. From experience, all I can say is that I’ve seen stripped down HTML work fine, and I’ve seen markdown work fine. The one place where markdown clearly shines is that it tends to use fewer tokens.

reply
leroman
1 month ago
[-]
Author here- it's a good point to have some benchmarks (which I don't have..) but I think it's well understood that minimizing noise by reducing tokens will improve the quality of the answer. And I think by now LLMs are well versed in Markdown, as it's the preferred markup language used when generating responses
reply
pseudosavant
1 month ago
[-]
My anecdotal experience is that Markdown usually does work better than HTML. I only leave it as HTML if the LLM needs to understand more about it than just the content, like the attributes on the elements (which would typically be a lot of noise, excess token input). I've found this to be especially true when using AI/LLMs in RAG scenarios.
reply
mistercow
1 month ago
[-]
I wouldn’t be so sure on reducing tokens. Every token in context is space for the LLM to do more computation. Noise is obviously bad, because the computations will be irrelevant, but as long as your HTML is cleaned up, the extra tokens aren’t noise, but information about semantic structure.
reply
leroman
1 month ago
[-]
Markdown being a very minimal Markup language has no need for much of the structural and presentational stuff (CSS, structural HTML), HTML has many many artifacts which are a huge bloat and give no semantic value IMO.. It's the goal here to capture any markup with semantic value, if you have examples this library might miss, you are welcome to share and I will look into it!
reply
mistercow
1 month ago
[-]
Well, markdown and HTML are encoding the same information, but markdown is effectively compressing the semantic information. This works well for humans, because the renderer (whether markdown or plaintext) decompresses it for us. Two line breaks, for example, “decompress” from two characters to an entire line of empty space. To an LLM, though, it’s just a string of tokens.

So consider this extreme case: suppose we take a large chunk of plaintext and compress it with something like DEFLATE (but in a tokenizer friendly way), so that it uses 500 tokens instead of 2000 tokens. For the sake of argument, say we’ve done our best to train an LLM on these compressed samples.

Is that going to work well? After all, we’ve got the same information in a quarter as many tokens. I think the answer is pretty obviously “no”. Not only are we using a small fraction as much time and space to process the information, but the LLM will be forced to waste a lot of that computation on decompressing the data.

reply
michaelmior
1 month ago
[-]
I think one big difference between DEFLATE and most other standard compression algorithms is that they're dictionary-based. So compressing in this way, you're really messing with locality of tokens in way that is likely unrelated to the semantics of what you're compressing.

For example, adding a repeated word somewhere in a completely different part of the document could change the dictionary and the entirety of the compressed text. That's not the case with the "compression" offered by converting HTML to Markdown. This compression more or less preserves locality and potentially removes information that is semantically meaningless (e.g. nested `div`s used for styling). Of course, this is really just conjecture on my part, but I think HTML > Markdown is likely to work well. It would certainly be interesting to have a good benchmark for this.

reply
mistercow
1 month ago
[-]
Absolutely. I'm just making a more general point that "the same information in fewer tokens" does not mean "more comprehensible to an LLM". And we have more practical evidence that that's not the case, like the recent "Let's Think Dot by Dot" paper, which found that you can get many of the benefits of chain-of-thought simply by adding filler tokens to your context (if your model is trained to deal with filler tokens). For that matter, chain-of-thought itself is an example of increasing the tokens:information ratio, and generally improves LLM performance.

That's not to say that I think that converting to markdown is pointless or particularly harmful. Reducing tokens is useful for other reasons; it reduces cost, makes generation faster, and gives you more room in the context window to cram information into. And markdown is a nice choice because it's more comprehensible to humans, which is a win for debuggability.

I just don't think you can justifiably claim, without specific research to back it up, that markdown is more comprehensible to LLMs than HTML.

https://arxiv.org/abs/2404.15758

reply
michaelmior
1 month ago
[-]
I think it's a reasonable claim. But I would agree that it's worthy of more detailed investigation.
reply
sigmoid10
1 month ago
[-]
They understand best whatever was used during their training. For OpenAI's GPTs we don't really know since they don't disclose anything anymore, but there are good reasons to assume they used markdown or something closely related.
reply
jddj
1 month ago
[-]
Just out of curiosity, what are some of those good reasons?

It's clear enough that they can use and consume markdown, but is the suggestion here that they've seen more markdown than xml?

I'd have guessed possibly naively that they fed in more straight html but I'd be interested to know why that's unlikely to be the case

reply
sigmoid10
1 month ago
[-]
Well, for one, their chat markup language (i.e. what they used for chat/instruction tuning). But they closed the source on that last year, so we don't know what it looks like anymore. I doubt it changed much though. Also, when you work with their models a lot for e.g. document processing, you'll find that markdown tends to work better in the context than, say, html. I've heard similar observations from people at other companies.
reply
DeveloperErrata
1 month ago
[-]
It's neat to see this getting attention. I've used similar techniques in production RAG systems that query over big collections of HTML docs. In our case the primary motivator was higher token efficiency (ie to represent the same semantic content but with a smaller token count).

I've found that LLMs are often bad at understanding "standard" markdown tables (of which there are many different types, see: https://pandoc.org/chunkedhtml-demo/8.9-tables.html). In our case, we found the best results when keeping the HTML tables in HTML, only converting the non-table parts of the HTML doc to markdown. We also strip the table tags of any attributes, with the exception of colspan and rowspan which are semantically meaningful for more complicated HTML tables. I'd be curous if there are LLM performance differences between the approach the author uses here (seems like it's based on repeating column names for each cell?) and just preserving the original HTML table structure.

reply
richardreeze
1 month ago
[-]
This is really cool. I've already implemented it in one of my tools (I found it to work better than the Turndown/ Readability combination I was previously using).

One request: It would be great if you also had an option for showing the page's schema (which is contained inside the HTML).

reply
leroman
1 month ago
[-]
Thanks for sharing!!

Would be really helpful if you opened an issue in Github with a specific example, happy to look into that!

reply
richardreeze
1 month ago
[-]
reply
la_fayette
1 month ago
[-]
The scoring approach seems interesting to extract the main content of web pages. I am aware of the large body of decades of research on that subject, with sophisticated image or nlp based approaches. Since this extraction is critical to the quality of the LLM response, it would be good to know how well this performs. E.g., you could test it against a test dataset (https://github.com/scrapinghub/article-extraction-benchmark). Also, you could provide the option to plugin another extraction algorithm, since there are other implementations available... just some ideas for improvement...
reply
leroman
1 month ago
[-]
This totally makes sense, I will look into adding support for additional ways to detect the main content, super interesting!
reply
gradientDissent
1 month ago
[-]
Nice work. Main content extraction based on the <main> tag won’t work with most of the web pages these days. Arc90 could help.
reply
leroman
1 month ago
[-]
Thank you! this is exactly why there's support for this specific use case- https://github.com/romansky/dom-to-semantic-markdown/blob/ma... (see `findContentByScoring`)

And if you pass an optional flag `extractMainContent` it will use some heuristics to find the main content container if there is no such tag..

reply
kartoolOz
1 month ago
[-]
WebArena does this really well, called the "accessibility_tree" https://github.com/web-arena-x/webarena/blob/main/browser_en...
reply
nvartolomei
1 month ago
[-]
While I was writing a tool for myself to summarise daily the top N posts from HN, Google Trends, and RSS feed subscriptions I had the same problem.

The quick solution was to use beautiful soup and readability-lxml to try and get the main article contents and then send it to an LLM.

The results are ok when the markup is semantic. Often it is not. Then you have tables, images, weirdly positioned footnotes, etc.

I believe the best way to extract information the way it was intended to be presented is to screenshot the page and send it to a multimodal LLM for “interpretation”. Anyone experimented with that approach?

——

The aspiration goal for the tool is to be the Presidential Daily Brief but for everyone.

reply
KolenCh
1 month ago
[-]
I am curious how it would compare to using pandoc with readability algorithm for example.
reply
leroman
1 month ago
[-]
Bumped this together with the side-by-side comparison task.. so will look into it :)
reply
alexliu518
1 month ago
[-]
Converting web pages to Markdown is a common requirement. I have found that turndown does a good job, but it cannot meet the needs of all dynamic web page content. As far as I know, if you need to process dynamic web pages, you need targeted adaptation, such as Google extensions such as Web2Markdown.
reply
throwthrowuknow
1 month ago
[-]
Thank you! I’m always looking for new options to use for archiving and ingesting web pages and this looks great! Even better that it’s an npm package!
reply
leroman
1 month ago
[-]
You might find this useful- just added code & instructions on how to make it a global CLI utility- https://github.com/romansky/dom-to-semantic-markdown/blob/ma...
reply
jejeyyy77
1 month ago
[-]
hah, out of curiosity, what are you archiving and ingesting webpages for?
reply
throwthrowuknow
1 month ago
[-]
Mostly for integration with my Obsidian vault so I don’t have to leave the app and can add notes and links and avoid linkrot.
reply
zaSmilingIdiot
1 month ago
[-]
For personal use, and on the topic of Obsidian, I rolled my own form of this... Its quick and dirty, but generally works for my usecase. I tend to push a page through turndown [0] to generate the markdown, then write this into obsidian (also storing things link a copy of the rendered page, link to the source, etc).

[0] https://github.com/mixmark-io/turndown

reply
nbbaier
1 month ago
[-]
This is really cool! Any plans to add Deno support? This would be a great fit for environments like val.town[0], but they are based on a Deno runtime and I don't think this will work out of the box.

Also, when trying to run the node example from your readme, I had to use `new dom.window.DOMParser()` instead of `dom.window.DOMParser`

[0]: https://val.town

reply
leroman
1 month ago
[-]
Afraid to say that other than bumping into a talk about Deno, I haven’t played around with it yet.. So thanks for the heads up, will look into it.

Thanks for the bug report !

reply
nbbaier
1 month ago
[-]
Happy to also take a swing at it, but it would take me a bit because I've never added such compatibility to a library before.

Any specific guidelines for contributing? I see that you're open to contributions.

reply
leroman
1 month ago
[-]
By all means, you can be the first contributor :) You are welcome to either open an issue and brain storm together on possible approaches or send me a pull request with what you came up with and we start there
reply
nbbaier
1 month ago
[-]
I'll probably jump in sometime in the next week!
reply
KolenCh
1 month ago
[-]
Does anyone compare the performance between HTML input and other formats? I did an informal comparison and from a few tests it seems the HTML input is better. I thought having markdown input would be more efficient too but I’d like to see more systematic comparison to see it is the case.
reply
brightvegetable
1 month ago
[-]
This is great, I was just in need of something like this. Thank!
reply
explosion-s
1 month ago
[-]
How is this different than any other HTML to markdown library, like Showdown or Turndown? Is there any specific features that make it better for LLMS specifically instead of just converting HTML to MD?
reply
leroman
1 month ago
[-]
Will add some side-by-side comparisons soon! the goal is not just to translate 1:1 HTML to markdown but to preserve any semantic information, this is generally not the goal for these tools. Some specific features and examples are in the README, like URL minification and optional main section detection and extraction (ignoring footer / header stuff).
reply
Layvier
1 month ago
[-]
Nice, we have this exact use case for data extraction from scraped webpages. We've been using html-to-md, how does it compare to it?
reply
Zetaphor
1 month ago
[-]
A browser demo would be a nice addition to this readme
reply
leroman
1 month ago
[-]
reply
leroman
1 month ago
[-]
Ah, I suppose you mean a web page one could visit to see a demo :) Added to the backlog!
reply
DevX101
1 month ago
[-]
Problem is, with modern websites, everything is a div and you can't necessarily infer semantic meaning from the DOM elements.
reply
leroman
1 month ago
[-]
After removing the noise you can distill the semantic stuff where ever possible, like meta-deta from images, buttons, etc, and see some structures emerge like footers and nav and body.. And many times for the sake of SEO and accessibility, websites do adopt quite a bit of semantic HTML elements and annotations in respective tags..
reply
goatlover
1 month ago
[-]
What happened to using the semantic elements? Did that fall out of favor or the push for it get abandoned because popular frameworks just generate divs with semantic classes (hopefully)?
reply
ianbicking
1 month ago
[-]
This is a great idea! There's an exceedingly large amount of junk in a typical HTML page that an LLM can't use in any useful way.

A few thoughts:

1. URL Refification[sic] would only save tokens if a link is referred to many times, right? Otherwise it seems best to keep locality of reference. Though to the degree that URLs are opaque to the LLM, I suppose they could be turned into references without any destination in the source at all, and if the LLM refers to a ref link you just look it up the real link in the mapping.

2. Several of the suggestions here could be alternate serializations of the AST, but it's not clear to me how abstract the AST is (especially since it's labelled as htmlToMarkdownAST). And now that I look at the source it's kind of abstract but not entirely: https://github.com/romansky/dom-to-semantic-markdown/blob/ma... – when writing code like this I also find keeping the AST fairly abstract also helps with the implementation. (That said, you'll probably still be making something that is Markdown-ish because you'll be preserving only the data Markdown is able to represent.)

3. With a more formal AST you could replace the big switch in https://github.com/romansky/dom-to-semantic-markdown/blob/ma... with a class that can be subclassed to override how particular nodes are serialized.

4. But I can also imagine something where there's a node type like "markdown-literal" and to change the serialization someone could, say, go through and find all the type:"table" nodes and translate them into type:"markdown-literal" and then serialize the result.

5. A more advanced parsing might also turn things like headers into sections, and introduce more of a tree of nodes (I think the AST is flat currently?). I think it's likely that an LLM would follow `<header-name-slug>...</header-name-slug>` better than `# Header Name\n ....` (at least sometimes, as an option).

6. Even fancier if, running it with some full renderer (not sure what the options are these days), and you start to use getComputedStyle() and heuristics based on bounding boxes and stuff like that to infer even more structure.

7. Another use case that could be useful is to be able to "name" pieces of the document so the LLM can refer to them. The result doesn't have to be valid Markdown, really, just a unique identifier put in the right position. (In a sense this is what URL reification can do, but only for URLs?)

reply
leroman
1 month ago
[-]
This is some great feedback, thanks!

1. there some crazy links with lots of arguments and tracking stuff in them, so it gets very long, the refification turns them into a numbered "ref[n]" scheme, where you also get a map of ref[n]->url to do reverse translation.. it really saves a lot, in my experience. It's also optional, so you can be mindful when you want to use this feature..

2. I tried to keep it domain specific (not to reinvent HTML...) so mostly Markdown components and some flexibility to add HTML elements (img, footer etc).

3. Not sure I'm sold with replacing the switch, it's very useful there because of the many fall through cases.. I find it maintainable but if you point me to some specific issue there it would help

4. There are some built in functions to traverse and modify the AST. It is just JSON in the end of the day so you could leverage the types and write your own logic to parse it, as long as it conforms to the format you can always serialize it, as you mentioned..

5. The AST is recursive so not flat.. sounds like you want to either write your own AST->Semantic-Markdown implementation or plug into the existing one so I'll this in mind in the future

6. Sounds cool but out of scope at the moment :)

7. This feature would serve to help with scraping and kind of point the LLM to some element? Then the part I'm missing is how you would code this in advance.. There could be some meta-data tag you could add and it would be taken through the pipeline and added on the other side to the generated elements in some way..

reply