Crawling BitTorrent DHTs for Fun and Profit [pdf]
78 points
3 days ago
| 6 comments
| usenix.org
| HN
hdgr
8 hours ago
[-]
Bitmagnet -https://bitmagnet.io/ - does exactly that. I left it running for a few weeks and then stopped the crawler. Didn't expect much, but still somewhat disappointed by the garbage it reeled in.
reply
NegativeLatency
7 hours ago
[-]
I've had one running for over a year now, it's replaced my usage of regular torrent sites completely, there is a lot of junk, and it gets stale, but it's still a better experience than most of the public trackers out there IMO
reply
qingcharles
7 hours ago
[-]
Are you running it at home?

I built one with a nice TUI to run on a VPS so I can try and find rare magazine torrents, but Hetzner were upset about it. I need to find it a new home. It was a very good citizen, but it still raised too many flags.

reply
NegativeLatency
1 hour ago
[-]
yes but with a VPN to my seedbox, all the egress from the docker containers its running in goes through this: https://github.com/passteque/gluetun

the seedbox is through https://www.feralhosting.com used them for over 10 years now and they've been great (shared hosting so I have linuxbrew setup there, but no docker sadly)

reply
fc417fc802
5 hours ago
[-]
I think generally you have to be very conservative about how you use ultra cheap hosts like hetzner simply due to the economics. Either find a more expensive service that will exert more effort towards discretion or alternatively spend $5 per month on a VPN that's friendly to torrents.
reply
farnsworthfusor
4 hours ago
[-]
I heard the more money you're paying them the more lenient they are.

For cheap hosts look for ones that allow tor exit nodes if you're looking for ones that allow funny stuff. There are some that allow it for ideological reasons. Look through the hundreds on lowendtalk. On that forum you can even ask the providers directly if they allow it.

reply
k4rli
7 hours ago
[-]
Runs fine at home. I've indexed 20M+ torrents in last few months running it during the day. With Prowlarr (or similar) it could easily replace other indexers.
reply
NoMoreNicksLeft
7 hours ago
[-]
Which magazines?
reply
qingcharles
6 hours ago
[-]
Anything I don't have! Sometimes I'll find a torrent and no seeds/peers and I'll wonder if there is another torrent out there that has the same files in it somewhere that I can find.

The other day it was trying to track down some older High Times issues that were torrented but the torrent is dead. Last night it was a mag titled Films & Filming which I know is scanned, but I can't find anywhere.

reply
toomuchtodo
6 hours ago
[-]
Can I get a copy for the Internet Archive? Will take as much of the corpus as you’re willing to provide.

(no affiliation with them)

reply
qingcharles
46 minutes ago
[-]
I'm starting to upload everything very soon. I have >4million magazines here so far I think. Feel free to email me on my profile :)
reply
NoMoreNicksLeft
4 hours ago
[-]
High Times is mostly on archive.org, if you need that one. I'd sort of like the film-making one, I'll put some time into that. On my list of periodicals, I think the count's up to 500 that I consider important enough to archive and I'm nowhere near done with it.
reply
qingcharles
45 minutes ago
[-]
Yeah, I have the High Times from archive.org, but it's missing a lot of issues which were torrented at some point. If there's anything you need send me an email on my profile.
reply
gonzalohm
3 hours ago
[-]
How much space do you need to store the index?
reply
plusfour
1 hour ago
[-]
160gb for 27m torrents
reply
plusfour
5 hours ago
[-]
Same here, for over a year. how many torrents has yours indexed?
reply
NegativeLatency
1 hour ago
[-]
28M
reply
drdexebtjl
7 hours ago
[-]
I disabled mine because it was constantly writing to my SSD.
reply
felooboolooomba
7 hours ago
[-]
I solved it by storing the data on /dev/null
reply
Daviey
6 hours ago
[-]
The writes are insanely fast.
reply
permalac
6 hours ago
[-]
Pretty big space.
reply
muyuu
5 hours ago
[-]
probably worth adding some ML filter to it because yeah, most of the bulky stuff in bittorrent is always going to be garbage - a lot like the internet generally, the value is in filtering the good stuff out
reply
infinite_spin
2 hours ago
[-]
out of curiosity, what kind of junk/garbage is typical?
reply
yeeeloit
47 minutes ago
[-]
ungodly amounts of porn.
reply
qingcharles
6 hours ago
[-]
This is a good tip, thanks. I'll probably replace my home-grown scanner for this one.
reply
gritzko
7 hours ago
[-]
2010. I remember those times. I was doing these things for science in 2008. Performance-wise, PEX was much faster than DHT. At least, in my setting.

This year, I was giving it as an assignment to students. Does not take much time with LLMs.

reply
the8472
4 hours ago
[-]
Crawling has been somewhat simplified with BEP 51

https://bittorrent.org/beps/bep_0051.html

reply
hackingonempty
8 hours ago
[-]
(2010)
reply
Boss0565
6 hours ago
[-]
old paper
reply
MoonWalk
6 hours ago
[-]
The article neglects to define "DHT" before using it.
reply
ivanjermakov
6 hours ago
[-]
Distributed hash table - ButTorrent extension for discovering torrent's seeders by advertising its hash across known peer pool, think of it as a distributed tracker. Contrary to traditional way of asking a known tracker for peers of that torrent.

Its algorithm is very elegant, using binary search on peers' and torrents' hashes, narrowing down to peers that are more likely to be seeders (or at least know some).

https://www.bittorrent.org/beps/bep_0005.html

reply
loeg
6 hours ago
[-]
Not a P2P innovation with Bittorrent, FWIW. Kademlia DHT (used in eMule/LimeWire/Gnutella P2P networks) long predates Bittorrent.
reply