Show HN: Context-aware Japanese furigana using Sudachi and ModernBERT
7 points
2 hours ago
| 2 comments
| ezfurigana.com
| HN
altilunium
26 minutes ago
[-]
It really works. Very cool. I’ve been looking for this kind of service for a long time since I started learning Japanese, and I’ve rarely been satisfied with the available services.
reply
epitrochoid413
2 hours ago
[-]
I built a context-aware furigana converter for Japanese text, files, and web pages.

The main problem I wanted to solve was that simple dictionary-based furigana works well for common cases, but breaks on words where the reading depends on context:

* 市場: いちば or しじょう

* 大分: おおいた or だいぶ

* 人気: にんき or ひとけ

* 最中: さいちゅう or さなか or もなか

* 方: かた or ほう

The engine is a hybrid system:

* Sudachi for tokenization, base forms, POS, and candidate readings

* Expanded dictionary coverage for compounds and fixed expressions

* Custom rules for counters, suffixes, rendaku patterns, and phrase overrides

* ModernBERT fallback for 144 especially context-dependent target words

I have been testing it against an LLM-assisted benchmark of 7,500 Japanese lines. On the current benchmark, it gets about 12 wrong readings per 1,000 tokens. I treat that as a practical regression benchmark rather than a formal academic evaluation, but it has been useful for comparing versions and catching regressions.

The hardest remaining cases are personal names, place names, rendaku, rare vocabulary, and domain-specific terms.

I would especially appreciate examples where it gets the reading wrong, since those are the most useful for improving the system.

reply
fenomas
6 minutes ago
[-]
Nice work, just gave a quick pass but seems to work well!

(Also: vouched, your comment was dead FYI)

reply