FilterHN

amkharg26

7 minutes ago

[-]

Impressive performance gains! 5x faster than MuPDF is significant, especially for applications processing large volumes of PDFs. Zig's memory safety without garbage collection overhead makes it ideal for this kind of performance-critical work.

I'm curious about the trade-offs mentioned in the comments regarding Unicode handling. For document analysis pipelines (like extracting text from technical documentation or research papers), robust Unicode support is often critical.

Would be interesting to see benchmarks on different PDF types - academic papers with equations, scanned documents with OCR layers, and complex layouts with tables. Performance can vary wildly depending on the document structure.

polyaniline

31 seconds ago

[-]

What memory safety?

5 hours ago

[-]

  74910,74912c187768,187779
  < [Example 1: If you want to use the code conversion facetcodecvt_utf8to output tocouta UTF-8 multibyte sequence
  < corresponding to a wide string, but you don't want to alter the locale forcout, you can write something like:\237 D.27.21954
                                                                                                                                \251ISO/IECN4950wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
  < std::string mbstring = myconv.to_bytes\050L"Hello\134n"\051;
  ---
  >
  > [Example 1: If you want to use the code conversion facet codecvt_utf8 to output to cout a UTF-8 multibyte sequence
  > corresponding to a wide string, but you don’t want to alter the locale for cout, you can write something like:
  >
  > § D.27.2
  > 1954
  >
  > © ISO/IEC
  > N4950
  >
  > wstring_convert<std::codecvt_utf8<wchar_t>> myconv;
  > std::string mbstring = myconv.to_bytes(L"Hello\n");

Is indeed faster but output is messier. And doesn't handle Unicode in contrast to mutool that does. (Probably also explains the big speed boost.)

TZubiri

5 hours ago

[-]

In my experience with parsing PDFs, speed has never been an issue, it has always been a matter of quality.

DetroitThrow

4 hours ago

[-]

I tried a small PDF and got a memory error. It's definitely much faster than MuPDF on that file.

5 hours ago

[-]

fixed.

[0]: https://repository.kallipos.gr/handle/11419/15087

4 hours ago

[-]

Yeah, sorry for confusion. When said Unicode, meant foreign text rather (just) the unescaped symbols, e.g. Greek. At one random Greek textbook[0], zpdf output is (extract | head -15):

  01F9020101FC020401F9020301FB02070205020800030209020701FF01F90203020901F9012D020A0201020101FF01FB01FE0208 
  0200012E0219021802160218013202120222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C

  020301FF02000205020101FC020901F90003020001F9020701F9020E020802000205020A 
  01FC028C0213021B022002230221021800030200012E021902180216021201320221021A012E00030209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C 
 
  0200020D02030208020901F90203020901FF0203020502080003012B020001F9012B020001F901FA0205020A01FD01FE0208 
  020201300132012E012F021A012F0210021B013202200221012E0222 0209021D0212021D012E013202200222000301FA021A0220021C022002160213012E0222000F000301F90206012C

This for entire book. Mutool extracts the text just fine.

3 hours ago

[-]

sorry, I haven't yet figured out non-latin with tounicode references.

TZubiri

5 hours ago

[-]

Lol, but there's 100 competitors in the PDF text extraction space, some are multi million dollar industries: AWS textract, ABBY PDFreader, PDFBox, I think you may be underestimating the challenge here.

8 hours ago

[-]

I built a PDF text extraction library in Zig that's significantly faster than MuPDF for text extraction workloads.

~41K pages/sec peak throughput.

Key choices: memory-mapped I/O, SIMD string search, parallel page extraction, streaming output. Handles CID fonts, incremental updates, all common compression filters.

~5,000 lines, no dependencies, compiles in <2s.

Why it's fast:

  - Memory-mapped file I/O (no read syscalls)
  - Zero-copy parsing where possible
  - SIMD-accelerated string search for finding PDF structures
  - Parallel extraction across pages using Zig's thread pool
  - Streaming output (no intermediate allocations for extracted text)

What it handles:

  - XRef tables and streams (PDF 1.5+)
  - Incremental PDF updates (/Prev chain)
  - FlateDecode, ASCII85, LZW, RunLength decompression
  - Font encodings: WinAnsi, MacRoman, ToUnicode CMap
  - CID fonts (Type0, Identity-H/V, UTF-16BE with surrogate pairs)

DannyBee

1 hour ago

[-]

FWIW - mupdf is simply not fast. I've done lots of pdf indexing apps, and mupdf is by far the slowest and least able to open valid pdfs when it came to text extraction. It also takes tons of memory.

a better speed comparison would either be multi-process pdfium (since pdfium was forked from foxit before multi-thread support, you can't thread it), multi-threaded foxit, or something like syncfusion (which is quite fast and supports multiple threads). Or even single thread pdfium vs single thread your-code.

These were always the fastest/best options. I can (and do) achieve 41k pages/sec or better on these options.

The other thing it doesn't appear you mention is whether you handle putting the words in reading order (IE how they appear on the page), or only stream order (which varies in its relation to apperance order) .

If it's only stream order, sure, that's really fast to do. But also not anywhere near as helpful as reading order, which is what other text-extraction engines do.

Looking at the code, it looks like the code to do reading order exists, but is not what is being benchmarked or used by default?

If so, this is really comparing apples and oranges.

tveita

6 hours ago

[-]

What kind of performance are you seeing with/without SIMD enabled?

From https://github.com/Lulzx/zpdf/blob/main/src/main.zig it looks like the help text cites an unimplemented "-j" option to enable multiple threads.

There is a "--parallel" option, but that is only implemented for the "bench" command.

6 hours ago

[-]

I have now made parallel by default and added an option to enable multiple threads.

I haven't tested without SIMD.

cheshire_cat

6 hours ago

[-]

You've released quite a few projects lately, very impressive.

Are you using LLMs for parts of the coding?

What's your work flow when approaching a new project like this?

6 hours ago

[-]

Claude Code.

littlestymaar

6 hours ago

[-]

> Are you using LLMs for parts of the coding?

I can't talk about the code, but the readme and commit messages are most likely LLM-generated.

And when you take into account that the first commit happened just three hours ago, it feels like the entire project has been vibe coded.

Neywiny

5 hours ago

[-]

Hard disagree. Initial commit was 6k LOC. Author could've spent years before committing. Ill advised but not impossible.

littlestymaar

5 hours ago

[-]

Why would you make Claude write your commit message for a commit you've spent years working on though?

Neywiny

5 hours ago

[-]

1. Be not good at or a fan of git when coding

2. Be not good at or a fan of git when committing

Not sure what the disconnect is.

Now if it were vibecoded, I wouldn't be surprised. But benefit of the doubt

Jach

3 hours ago

[-]

We're well beyond benefit of the doubt these days. If it looks like a duck... For me there wasn't any doubt, the author's first top comment here was evidence enough, then seeing the readme + random code + random commit message, it's all obvious LLM-speak to me.

I don't particularly care, though, and I'm more positive about LLMs than negative even if I don't (yet?) use them very much. I think it's hilarious that a few people asked for Python bindings and then bam, done, and one person is like "..wha?" Yes, LLMs can do that sort of grunt work now! How cool, if kind of pointless. Couldn't the cycles have just been spent on trying to make muPDF better? Though I see they're in C and AGPL, I suppose either is motivation enough to do a rewrite instead. (This is MIT Licensed though it's still unclear to me how 100% or even large-% vibe-coded code deserves any copyright protection, I think all such should generally be under the Unlicense/public domain.)

If the intent of "benefit of the doubt" is to reduce people having a freak out over anyone who dares use these tools, I get that.

3 hours ago

[-]

I have updated the licence to WTFPL.

I'll try my best to make it a really good one!

jeffbee

6 hours ago

[-]

What's fast about mmap?

rishabhaiover

4 hours ago

[-]

it allows the program to reference memory without having to manage it in the heap space. it would make the program faster in a memory managed language, otherwise it would reduce the memory footprint consumed by the program.

jeffbee

4 hours ago

[-]

You mean it converts an expression like `buf[i]` into a baroque sequence of CPU exception paths, potentially involving a trap back into the kernel.

rishabhaiover

4 hours ago

[-]

I don't fully understand the under the hood mechanics of mmap, but I can sense that you're trying to convey that mmap shouldn't be used a blanket optimization technique as there are tradeoffs in terms of page fault overheads (being at the mercy of OS page cache mechanics)

StilesCrisis

15 minutes ago

[-]

Tradeoffs such as "if an I/O error occurs, the program immediately segfaults." Also, I doubt you're I/O bound to the point where mmap noticeably better than read, but I guess it's fine for an experiment.

jibal

2 hours ago

[-]

I think he's conveying that he doesn't know what he's talking about. buf[i] generates the same code regardless of whether mmap is being used. The first access to a page will cause a trap that loads the page into memory, but this is also true if the memory is read into.

jonstewart

5 hours ago

[-]

What’s the fidelity like compared to tika?

5 hours ago

[-]

The accuracy difference is marginal (1-2%) but the speed difference is massive.

mpeg

6 hours ago

[-]

very nice, it'd be good to see a feature comparison as when I use mupdf it's not really just about speed, but about the level of support of all kinds of obscure pdf features, and good level of accuracy of the built-in algorithms for things like handling two-column pages, identifying paragraphs, etc.

the licensing is a huge blocker for using mupdf in non-OSS tools, so it's very nice to see this is MIT

python bindings would be good too

6 hours ago

[-]

added a comparison, will improve further. https://github.com/Lulzx/zpdf?tab=readme-ov-file#comparison-...

also, added python bindings.

mpeg

5 hours ago

[-]

thanks, claude, I guess haha

as others have commented, I think while this is a nice portfolio piece, I would worry about its longevity as a vibe coded project

chanbam

5 hours ago

[-]

If he made something legitimately useful, who cares how?

odie5533

6 hours ago

[-]

Now we just need Python bindings so I can use it in my trash language of choice.

6 hours ago

[-]

added python bindings!

hiq

5 hours ago

[-]

Were you working on it already, or did it take you less than 17 minutes to commit https://github.com/Lulzx/zpdf/commit/9f5a7b70eb4b53672c0e4d8... ?

qeternity

2 hours ago

[-]

Claude Code.

agentifysh

6 hours ago

[-]

excellent stuff what makes zig so fast

observationist

6 hours ago

[-]

Not being slow - they compile straight to bytecode, they aren't interpreted, and have aggressive, opinionated optimizations baked in by default, so it's even faster than compiled c (under default conditions.)

Contrasted with python, which is interpreted, has a clunky runtime, minimal optimizations, and all sorts of choices that result in slow, redundant, and also slow, performance.

The price for performance is safety checks, redundancy, how badly wrong things can go, and so on.

A good compromise is luajit - you get some of the same aggressive optimizations, but in an interpreted language, with better-than-c performance but interpreted language convenience, access to low level things that can explode just as spectacularly as with zig or c, but also a beautiful language.

Zambyte

5 hours ago

[-]

Zig is safer than C under default conditions, not faster. By default does a lot of illegal behavior safety checking, such as array and slice bounds checking, numeric overflow checking, and invalid union access checking. These features are disabled by certain (non default) build modes, or explicitly disabled at a per scope level.

It may be easier to write code that runs faster in Zig than in C under similar build optimization levels, because writing high performance C code looks a lot like writing idiomatic Zig code. The Zig standard library offers a lot of structures like hash maps, SIMD primitives, and allocators with different performance characteristics to better fit a given use-case. C application code often skips on these things simply because it is a lot more friction to do in C than in Zig.

jibal

2 hours ago

[-]

> they compile straight to bytecode

machine code, not https://en.wikipedia.org/wiki/Bytecode

> The price for performance is safety checks

In Zig, non-ReleaseFast build modes have significant safety checks.

> luajit ... with better-than-c performance

No.

agentifysh

6 hours ago

[-]

will add this to the list, now learning new languages is less of a barrier with LLMs

AndyKelley

6 hours ago

[-]

It makes your development workflow smooth enough that you have the time and energy to do stuff like all the bullet points listed in https://news.ycombinator.com/item?id=46437289

5 hours ago

[-]

>you have the time and energy to do stuff like all the bullet points listed

Don't disagree but in specific case, per the author, project was made via Claude Code. Although could as well be that Zig is better as LLM target. Noticed many new vibe projects decide to use Zig as target.

littlestymaar

6 hours ago

[-]

- First commit 3hours ago.

- commit message: LLM-generated.

- README: LLM-generated.

I'm not convinced that projects vibe coded over the evening deserve the HN front page…

Edit: and of course the author's blog is also full of AI slop…

2026 hasn't even started I already hate it.

ncgl

4 hours ago

[-]

Using Ai isn't lazier than your regurgitated dismissal, to be fair.

dmytrish

5 hours ago

[-]

...and it does not work. I tried it on ~10 random pdfs, including very simple ones (e.g. a hello world from typst), it segfaults on every single one.

5 hours ago

[-]

Tried few and works. Maybe you've older or newer Zig version than whatever project targets. (Mine is 0.15.2.)

dmytrish

4 hours ago

[-]

   ~/c/t/s/zpdf (main)> zig version
   0.15.2

Sky is blue, water is wet, slop does not work.