Show HN: 4B+ DNS Records Dataset
88 points
3 days ago
| 13 comments
| merklemap.com
| HN
Hi HN,

I've been working on building a pipeline to create a DNS records database lately. The goal is to enable research as well as competitive landscape analysis on the internet.

The dataset for now spans around 4 billion records and covers all the common DNS record types:

    A
    AAAA 
    ANAME
    CAA
    CNAME
    HINFO
    HTTPS
    MX
    NAPTR
    NS
    PTR 
    SOA
    SRV
    SSHFP
    SVCB
    TLSA
    TXT
Each line in the CSV file represents a single DNS record in the following format: www.example.com,A,93.184.215.14

Let me know if you have any questions or feedback!

genmud
2 days ago
[-]
Neat! How is this different than domaintools/farsight [1]?

Passive DNS [2] has been in my toolbox for 15+ years, and is invaluable for security research / threat intelligence. Knowing historical resolutions to something are so helpful in investigations.

For anyone interested, they should check out the talk by one of the DomainTools people [3] on how it can be utilized for investigation.

Are you passively collecting this data, or actively querying for these records?

[1] - https://www.domaintools.com/products/threat-intelligence-fee...

[2] - https://www.circl.lu/services/passive-dns/

[3] - https://www.youtube.com/watch?v=oXmapqLkZd0

reply
lyu07282
2 days ago
[-]
is this making use of letsencrypt as well? afaik all letsencrypt signed certificates including all subdomains are immediately public, which could be useful for security research as well
reply
Eikon
2 days ago
[-]
It's not about letsencrypt but certificate transparency which works the same for all public CAs.

I wrote a documentation piece here:

https://www.merklemap.com/documentation/how-it-works

reply
whalesalad
2 days ago
[-]
At first glance it looks like this data is generated via the public certificate transparency log, so I would imagine the answer is yes.
reply
Eikon
2 days ago
[-]
From what I understand [1] is just tlds, not subdomains?
reply
genmud
2 days ago
[-]
That would be incorrect, they get subdomains for passive dns feeds.
reply
Eikon
2 days ago
[-]
Ok, it'd be interesting to know how big is their datasets compared to mine and how much they overlap.
reply
blex
7 hours ago
[-]
Is there a good tool to browse big text archives, like .csv.xz, .csv.gz, or .7z, without decompressing them?

I don't want to decompress 29 GB into 211 GB each time I want to make a search.

Except grep / zgrep, is there a good tool/viewer (or hex editor that can decompress parts of big files for display) for this general task?

reply
romperstomper
1 day ago
[-]
There are quite many duplicates, looks like for CNAME records only/mostly. Here are some from the beginning

  staging.pannekoeken-poffertjes-restaurant-amstelland.nl,CNAME,www.pannekoeken-poffertjes-restaurant-amstelland.nl.
  staging.pannekoeken-poffertjes-restaurant-amstelland.nl,CNAME,www.pannekoeken-poffertjes-restaurant-amstelland.nl.
  www.domiciliatuempresa.com,CNAME,domiciliatuempresa.com.
  www.domiciliatuempresa.com,CNAME,domiciliatuempresa.com.
  *.autokozmetikakaposvar.hu,CNAME,autokozmetikakaposvar.hu.
  *.autokozmetikakaposvar.hu,CNAME,autokozmetikakaposvar.hu.
  c7ac691a.oob-nuq1907.indubitably.xyz,CNAME,oob-nuq1907.hosts.secretcdn.net.
  c7ac691a.oob-nuq1907.indubitably.xyz,CNAME,oob-nuq1907.hosts.secretcdn.net.
etc
reply
Eikon
1 day ago
[-]
It’s because I don’t try to de duplicate and just saves whatever response I get, which translates to this obvious behavior for cnames. Shouldn’t be a big deal.

I may improve that in future releases.

reply
g-mork
2 days ago
[-]
Any possibility of adding (first seen, last seen) time stamps? There is basically no good way to reconstruct the state of e.g. SPF at a point in time from existing DNS data sets
reply
Eikon
2 days ago
[-]
I could in future releases, yes.
reply
ciclista
1 day ago
[-]
Would love the option of torrenting the file, download seems quite slow, and hopefully it would save you some bandwidth!
reply
Eikon
1 day ago
[-]
I was thinking about that, I’ll experiment by adding a .torrent file :)
reply
g48ywsJk6w48
2 days ago
[-]
Thank you for data set!!! It is not always lowercase, so it have some duplicates.

Also you can avoid unnecessary data with analyze CNAME records. -- domain.tld CNAME www.domain.tld -- So you can use only domain.tld or www.domain.tld records.

reply
m3047
2 days ago
[-]
I've worked in the industry at IID and Farsight. I am skeptical of many claims made by IoC vendors.

You need timestamps, or first / last seen.

Records don't exist in a vacuum. They come in RRsets. They are served (sometimes inconsistently) by different nameservers. Some use cases care about this.

Records which don't resolve are also useful, especially for use cases which amount to front-running. On any given day if the wind was blowing the right direction .belkin could be one of the top 10 non-resolving TLDs. If your data is any good, check under .cisco for stuff which resolves to 127.0.53.53. ;-)

Information about provenance (where the data comes from) is required for some use cases.

We shipped Farsight's DNSDB on one or more 1TB drives, depending on what the customer was purchasing.

reply
whalesalad
2 days ago
[-]
211GB seems very small. How is this generated?
reply
Eikon
2 days ago
[-]
What makes you think it's small?
reply
mobilio
2 days ago
[-]
note - that records can be geolocation routing.

This mean that from country A i can get records as X, but in country B records can be Y.

Would be great if you can make new column in CSV that can show about variations - Y/N.

reply
romperstomper
1 day ago
[-]
How many domains in this dataset?
reply
35mm
2 days ago
[-]
How often is it updated?

Does it include expired domains?

reply
Eikon
2 days ago
[-]
> How often is it updated?

I plan to do 2 releases a month for now, goal is one a day.

> Does it include expired domains?

Yes.

reply
mh-
2 days ago
[-]
This is fantastically valuable, especially if you can add the first/last-seen as requested by another commenter. Thanks for doing this.
reply
Eikon
2 days ago
[-]
Thanks.

That's quite a fun project!

reply
nhggfu
2 days ago
[-]
great work OP.
reply
Eikon
2 days ago
[-]
Thank you!
reply
T3RMINATED
2 days ago
[-]
Where do you get the data from? Does it include subdomains?
reply
Eikon
2 days ago
[-]
Hi,

https://www.merklemap.com/documentation/how-it-works

Basically the same process here but using that data to perform DNS queries.

reply