I built Defuddle while working on Obsidian Web Clipper[1] (also MIT-licensed) because Mozilla's Readability[2] appears to be mostly abandoned, and didn't work well for many sites.
It's still very much a work in progress, but I thought I'd share it today, in light of the announcement that Mozilla is shutting down Pocket. This library could be helpful to anyone building a read-it-later app.
Defuddle is also available as a CLI:
https://github.com/kepano/defuddle-cli
In the end I found the python trifatura library to extract the best quality content with accurate meta data.
You might want to compare your implementation to trifatura to see if there is room for improvement.
If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.
for the curious: Trafilatura means "extrusion" in Italian.
| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)
(btw I think you meant trafilatura not trifatura)
Even if you are not a obsidian user, the markdown extraction quality is the most reliable Ive seen.
Question: How did you validate this? You say it works better than readability but I don’t see any tests or datasets in the repo to evaluate accuracy or coverage. Would it be possible to share that as well?
Defuddle works quite differently from Readability. Readability tends to be overly conservative and tends to remove useful content because it tests blocks to find the beginning and end of the "main" content.
Defuddle is able to run multiple passes and detect if it returned no content to try and expand its results. It also uses a greater variety of techniques to clean the content — for example, by using a page's mobile styles to detect content that can be hidden.
Lastly, Defuddle is not only extracting the content but also standardizing the output (which Readability doesn't do). For example footnotes and code blocks all aim to output a single format, whereas Readability keeps the original DOM intact.
I would love to give Defuddle a try as a Readability replacement. However, for my use case I want to do in a Chrome extension background script (service worker). I have not been able to get Defuddle to work, while readability does (when combining with linkedom). So basically, while this works:
import { parseHTML } from 'linkedom';
...
private extractArticleWithReadability(html: string) {
const { document } = parseHTML(html);
const reader = new Readability(document);
return reader.parse();
}
This does not: import { parseHTML } from 'linkedom';
...
private async extractArticleWithDefuddle(html: string) {
const { document } = parseHTML(html);
const result = new Defuddle(document);
result.parse();
return result;
}
I get errors like:- Error in findExtractor: TypeError: Failed to construct 'URL': Invalid URL
- Defuddle: Error evaluating media queries: TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))
- Defuddle Error processing document: TypeError: b.getComputedStyle is not a function
Is there a way to run Defuddle in a chrome extension background script/service worker? Or do you have any plans of adding support for that?
1 such bug, find a foreign language with commas in between numbers instead of periods, like Dutch(I think), and a lot of prices on the page. It’ll think all the numbers are relevant text.
And of course I tried to open a pr and get it merged, but they require tests, and of course the tests don’t work on the page Im testing. It’s just very snafu imho
Clearly the comma thing is a bug, it's the lack of wanting to fix it actually that is a bit disheartening, and why I think it is a deadish repo
- Providing "reader mode" for your visitors
- Using it in a browser extension to add reader mode
- Scrapping
- Plugging it into a [reverse] proxy that automatically removes unnecessary bloat from pages, for e.g. easier access on retro hardware <https://web.archive.org/web/20240621144514/https://humungus....> (archive.org link, because the website goes down regularly)
You just get a completely white page (on the iPhone reader). Usually it’s a news website.
Is this the website intentionally obscuring the content to ensure they can serve their ads? If so how do they go about it?
On some websites, those are just modals that obscure the content, something that reader mode can usually deal with just fine, but on others, they're implemented as redirects or rendered server-side.
If reader mode doesn't work, dismiss those first and try again.
Thank you for picking up this work :-)
i started working on my own alternative but life (and web clipper) derailed the work.
it's funny. somehow slurp keeps gaining new users even though web clipper exists. so i might have to refactor it to use your library sometime soon even though I don't use slurp myself anymore.
And with Pocket going away I might have to add save it later to it...
Unfortunately I tried a bunch of hugging face mode on a I could run on my MacBook and all of them ignored my prompts despite trying every variation I could think of. Half the time they just tried summarizing it and describing what JavaScript was. :/
Not that I didn't already implement a read-it-later solution with Obsidian+Dataview, but this definitely makes things simpler!
Note that I'm using a preview (catalyst) version, it will reach stable soon. I'm assuming kepano will submit it here then.