FilterHN

Nokolexbor: Drop-in replacement for Nokogiri. 5.2x faster at parsing HTML

77 points

by ksec

6 days ago

| past

| 6 comments

| github.com

| HN

▲

captn3m0

6 days ago

[-]

It seems to have an in-tree libxml 2.11 for XPath support, which was released in 2023-04. Almost every second libxml release comes with a CVE, so I'm curious if there's plans to upgrade the libxml version, since it doesn't use the system libxml (same as nokogiri).

One of the reasons I still use nokogiri is because it puts a lot of effort into keeping libxml updated: https://github.com/sparklemotion/nokogiri/releases

▲

alyandon

5 days ago

[-]

I once had to remediate security vulns in a gigantic C++ project and came across an ancient vendored version of libxml.

To my knowledge, the project didn't use XML for anything so I started digging into why they vendored it to begin with. Turns out, they vendored the entirety of libxml so it could parse the ~5 line config file for the project that was written in XML instead of literally anything else. The config file format was simple key/value pairs.

I hate working in this field sometimes. :-/

▲

IgorPartola

5 days ago

[-]

This is why software development is more of a craft than it is science, at least when it comes to anything not directly to do with deep down algorithms and data structures (though the story of why GNU grep is so fast will be the exception even to that category).

▲

alyandon

5 days ago

[-]

Yeah, you aren't kidding. This particular project was a work of modern abstract art.

Imagine a large C++ code base with that was written by people that knew Java, Delphi and maybe a little C but apparently no C++. It was a mess of manual memory management errors, re-implemented Delphi classes like TString instead of using C++ equivalents, etc.

Why they decided to implement it in C++ instead of a language they knew I'll never understand. It took me about 2 months of continuous effort to replace all the C-isms, TStrings, XML config, etc with sane C++ equivalents. After I was done, it'd run for months with no issues whereas before you were lucky if it'd run for 2 days without it crashing.

Needless to say, I felt accomplished because I didn't get paged in the middle of the night from the thing falling over all the time.

▲

andai

5 days ago

[-]

I've heard it called swatting a fly with a plasma TV.

▲

ForceBru

6 days ago

[-]

Lexbor can also be used from Python: https://github.com/rushter/selectolax

▲

gjtorikian

5 days ago

[-]

You may also be interested in https://github.com/gjtorikian/selma for high performance HTML manipulation. It’s built on Rust—Cloudflare’s lol_html parser to be precise.

▲

schneems

6 days ago

[-]

From the readme it seems it’s faster because lexbor is faster. What makes lexbor ao much faster?

▲

jedisct1

5 days ago

[-]

Very cool! But no updates for 7+ months? Is it still actively being developed?

▲

cxr

5 days ago

[-]

What do you expect to see getting worked on? Has something been added to the way HTML5 is supposed to be parsed in the last 7 months that this project doesn't handle but should be able to? Do you have a test case?

▲

matthewmorgan

5 days ago

[-]

Now if they could just solve cache invalidation