FilterHN

Show HN: JustHTML – A pure Python HTML5 parser that just works

2 points

by EmilStenstrom

32 minutes ago

| past

| 0 comments

| github.com

| HN

I got frustrated with HTML parsing in Python.

I wanted a Python HTML parser that was both correct and easy to install. The C-based ones (lxml, selectolax) are fast but not HTML5 compliant. The pure Python ones (html.parser, BeautifulSoup's default) are easy to install but choke on real-world HTML. html5lib is 80% correct but painfully slow.

So I wrote JustHTML. It's:

• 100% HTML5 compliant – passes all 8,500+ html5lib tests. If a browser can parse it, JustHTML can.

• Pure Python, zero dependencies – pip install and go. Works on PyPy, Pyodide, anywhere.

• Fast enough – ~0.1s to parse Wikipedia's homepage. Not C-fast, but 50% faster than html5lib.

• Simple API – doc.query("div.foo > p") with CSS selectors. One method to learn.

Example:

  from justhtml import JustHTML
  doc = JustHTML("<div><p class='intro'>Hello!</p></div>")
  print(doc.query(".intro")[0].to_html())

I've fuzz-tested it with 3 million malformed documents.

Would love feedback, especially on the API design.

No one has commented on this post.