FilterHN

snowwrestler

4 hours ago

[-]

Copying my comment from a previous discussion of ignoring robots.txt, below. I actually don’t care if someone ignores my robots.txt, as long as their crawler is well run. But the smug attitude is annoying when so many crawlers are not.

————

We have a faceted search that creates billions of unique URLs by combinations of the facets. As such, we block all crawlers from it in robots.txt, which saves us AND them from a bunch of pointless indexing load. But a stealth bot has been crawling all these URLs for weeks. Thus wasting a shitload of our resources AND a shitload of their resources too. Whoever it is, they thought they were being so clever by ignoring our robots.txt. Instead they have been wasting money for weeks. Our block was there for a reason.

nonethewiser

4 hours ago

[-]

Wasting your money too right?

I guess another angle on this is putting trust in people to comply with ROBOTS.txt. There is no guarantee so we should probably design with the assumption that our sites will be crawled however people want.

Also im curious about your use case.

>We have a faceted search that creates billions of unique URLs by combinations of the facets.

Are we talking about a search that has filters like (to use ecommerce as an example), brand, price range, color, etc. And then all these combinations make up a URL (hence bilions)? How does a crawler discover these? They are just designed to detect all these filters and try all combinations? That doesn't really jive with my understanding with crawlers but otherwise IDK how it would be generating billions of unique URLs. I guess maybe they could also be included in sitemaps but I doubt that.

xigoi

3 hours ago

[-]

> How does a crawler discover these? They are just designed to detect all these filters and try all combinations?

Presumably each of the facets is a link that adds it to the current query, so if you recursively follow links, you will end up with all combinations.

chao-

3 hours ago

[-]

I have experienced the same situation with facet-like pages in the past. Links leading to (for example) the same product with a different color pre-selected on page load. Or within in a listing of a category products, a link might change from price descending to price ascending. All else equal, even the crawlers don't want to re-index the same page in eighty different ways if they could avoid it. They simply don't know better, and have decided to ignore our attempt to teach them (robots.txt).

In the past, we've used this behavior as a signal to identify and block bad bots. These days, they will try again from 2000 separate residential IPs before they give up. But there was a long time where egregious duplicate page views of these faceted pages (against the advice of robots.txt) made detecting certain bad bots much easier.

rhet0rica

2 hours ago

[-]

I have two related stories.

Googlebot has been playing a multiple-choice flash card game on my site for months—the page picks a random question and gives you five options to choose from. Each URL contains all of the state of the last click: the option you chose, the correct answer, and the five buttons. Naturally, Google wants to crawl all the buttons, meaning the search tree has a branch factor of five and search space of about 5000^7 possible pages. Adding a robots.txt entry failed to fix this—now the page checks the user agent and tells Googlebot specifically to fuck off with a 403. Weeks later, I'm still seeing occasional hits. Worst of all it's pretty heavy-duty—the flash cards are for learning words, and the page generator sometimes sprinkles in items that look similar to the correct answer (i.e., they have a low edit distance.)

On the other hand there was a... thing crawling a search page on a separate site, but doing so in the most ass-brained way possible. Different IP addresses, all with fake user agents from real clients fetching search results for a database retrieval form with default options. (You really expect me to believe that someone on Symbian is fetching only page 6000 of all blog posts for the lowest user ID in the database?) The worst part about this one is that the URLs frequently had mangled query strings, like someone had tried to use substring functions to swap out the page number and gotten it wrong 30 times, resulting in Markov-like gibberish. The only way to get this foul customer to go away was to automatically ban any IP that used the search form incorrectly. So far I have banned 111,153 unique addresses.

robots.txt wasn't adequate to stop this madness, but I can't say I miss Ahrefs or DotBot trying to gather valuable SEO information about my constructed languages.

4 hours ago

[-]

Yes, this has been the traditional reason for robots.txt -- protects the bot as much as it does the site.

freedomben

4 hours ago

[-]

I don't know anything about your specific use case, so take this with a grain of salt, but I've experienced this as well and digging in it is usually vulnerability scanning

bonaldi

5 hours ago

[-]

Not sure the emotive language is warranted. Message appears to be “if you use robots.txt AND archive sites honor it AND you are dumb enough to delete your data without a backup THEN you won’t have a way to recover and you’ll be sorry”.

It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.

QuercusMax

4 hours ago

[-]

I just plain don't understand what they mean by "suicide note" in this case, and it doesn't seem to be explained in the text.

A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".

chao-

3 hours ago

[-]

I also cannot figure out from context what part of this is "suicide".

I don't even think it's a note saying your back door is unlocked? As myself and others shared in a sibling comment thread, we have worked at places that implemented robots.txt in order to prevent bots from getting into nearly-infinite tarpits of links that lead to nearly-identical pages.

3 hours ago

[-]

The meaning is reasonably clear to me: Robots.txt says "Don't archive this data. When the website dies, all the information dies with it." It's a kind of death pact.

QuercusMax

3 hours ago

[-]

That's not a suicide note, though, in any way I understand it.

3 hours ago

[-]

It's the inevitable suicide of the data.

Language gets weird when you anthropomorphize abstract things like "data", but I thought it was clever enough. YMMV.

QuercusMax

2 hours ago

[-]

The suicide of the data listed in robots.txt? How? The whole point of the article is they ignore what you have written in your robots.txt, so they'll archive it regardless of what you say.

2 hours ago

[-]

Correct, they are challenging your written wish for data-suicide.

bigbuppo

4 hours ago

[-]

Or major web properties for that matter.

4 hours ago

[-]

> volumes of LLM scraping

FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.

(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)

esseph

4 hours ago

[-]

It's hard because of attribution, but it absolutely is happening at very high volume. I actually got an alert this morning when I woke up from our monitoring tools that some external sites were being scraped. Happens multiple times a day.

A lot of it is coming through compromised residential endpoint botnets.

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3 hours ago

[-]

Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...

zzo38computer

2 hours ago

[-]

> It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs.

I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.

2 hours ago

[-]

> I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

Sure... but off-topic, right? AI companies are desperate for high quality data, and unlike search scrapers, are actually not supremely time sensitive. That is to say, they don't benefit from picking up on changes seconds after they are published. They essentially take a "snapshot" and then do a training run. There is no "real-time updating" of an AI model. So they have all the time in the world to wait for a page to reach an ideal state, as well as all the incentive in the world to wait for that too. Since the data effectively gets "baked into the model" and then is static for the entire lifetime of the model, you over-index on getting the data, not getting fast, or cheap, or whatever.

xena

3 hours ago

[-]

Hi, main author of Anubis here. How am I meant to store state like "user passed a check" without cookies? Please advise.

1. https://anubis.techaro.lol/docs/design/how-anubis-works

2 hours ago

[-]

If the rest of my post is accurate, that's not the actual concern, right? Since I'm not sure if the check itself is meaningful. From what is described in the documentation [1], I think the practical effect of this system is to block users running old mobile browsers or running browsers like Opera Mini in third world countries where data usage is still prohibitively expensive. Again, the off-the-shelf scraping tools [2] will be unaffected by any of this, since they're all built on top of Puppeteer, and additionally are designed to deal with the modern SPA web which is (depressingly) more or less isomorphic to a "proof-of-work".

If you are open to jumping on a call in the next week or two I'd love to discuss directly. Without going into a ton of detail, I originally started looking into this because the group I'm working with is exploring potentially funding a free CDN service for open source projects. Then this AI scraper stuff started popping up, and all of a sudden it looked like if these reports were true it might make such a project no longer economically realistic. So we started trying to collect data and concretely nail down what we'd be dealing with and what this "post-AI" traffic looks like.

As such, I think we're 100% aligned on our goals. I'm just trying to understand what's going on here since none of the second-order effects you'd expect from this sort of phenomenon seem to be present, and none of the places where we actually have direct data seem to show this taking place (and again, Cloudflare's data seems to also agree with this). But unless you already own a CDN, it's very hard to get a good sense of what's going on globally. So I am totally willing to believe this is happening, and am very incentivized to help if so.

EDIT: My email is my HN username at gmail.com if you want to schedule something.

2. https://apify.com/apify/puppeteer-scraper

rafram

2 hours ago

[-]

Cloudflare Turnstile doesn't require cookies. It stores per-request "user passed a check" state using a query parameter. So disabling cookies will just cause you to get a challenge on every request, which is annoying but ultimately fair IMO.

jddj

3 hours ago

[-]

Doesn't Wikipedia offer full tarballs?

This would imaginably put some downward pressure on scraper volume.

2 hours ago

[-]

From the first paragraph in my comment:

> You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

Yes, they do. But they aren't in a rush to tell AI companies this, because again, this is not actually a super meaningful amount of traffic increase for them.

imtringued

2 hours ago

[-]

I don't think you understand the purpose of Anubis. If you did then you'd realize that running a web browser with JS enabled doesn't bypass anything.

1. https://news.ycombinator.com/item?id=44944761

2 hours ago

[-]

By bypass I mean "successfully pass the challenge". Yes, I also have to sit through the Anubis interstitial pages, so I promise I know it's not being "bypassed". (I'll update the post to remove future confusion).

Do you disagree that a trivial usage of an off-the-shelf puppeteer scraper[1] has no problem doing the proof-of-work? As I mentioned in this comment [2], AI scrapers are not on some time crunch, they are happy to wait a second or two for the final content to load (there are plenty of normal pages that take longer than the Anubis proof of work does to complete), and also are unfazed by redirects. Again, these are issues you deal with normal everyday scraping. And also, do you disagree with the traffic statics from Cloudflare's site? If we're seeing anything close to that 18% increase then it would not seem to merit user-visible levels of mitigation. Even if it was 180% you wouldn't need to do this. nginx is not constantly on the verge of failing from a double digit "traffic spike".

As I mentioned in my response to the Anubis author here [3], I don't want this to be misinterpreted as a "defense of AI scrapers" or something. Our goals are aligned. The response there goes into detail that my motivation is that a project I am working on will potentially not be possible if I am wrong and this AI scraper phenomenon is as described. I have every incentive in the world to just want to get to the bottom of this. Perhaps you're right, and I still don't understand the purpose of Anubis. I want to! Because currently neither the numbers nor the mitigations seem to line up.

BTW, my same request extends to you, if you have direct experience with this issue, I'd love to jump on a call to wrap my head around this.

My email is my HN username at gmail.com if you want to reach out, I'd greatly appreciate it!

2. https://apify.com/apify/puppeteer-scraper

3. https://news.ycombinator.com/item?id=44944886

tracerbulletx

5 hours ago

[-]

This is a screed that does not address a single point of the actual philosophical issue.

The issue is a debate between what the expectations are for content posted on the public internet. There is the viewpoint that it should be totally machine operable and programmatic and if you want it to be private you should gate it behind authentication, that the semantic web is an important concept and violating it is a breach of protocol. There's also the argument that it's your content, no one has a right to it, and you should be able to license its use anyway you want. There is a trade off between the implications of the two.

rafram

5 hours ago

[-]

I think this is kind of misguided - it ignores the main reason sites use robots.txt, which is to exclude irrelevant/old/non-human-readable pages that nevertheless need to remain online from being indexed by search engines - but it's an interesting look at Archive Team's rationale.

xp84

5 hours ago

[-]

Yes, and I'd add to that dynamically-generated URLs of infinite variability which have two separate but equally-important reasons for automated traffic to avoid:

1. You (bot) are wasting your bandwidth, CPU, storage on a literally unbounded set of pages

2. This may or may not cause resource problems for the owner of the site (e.g. Suppose they use Algolia to power search and you search for 10,000,000 different search terms... and Algolia charges them by volume of searches.)

The author of this angry rant really seems specifically ticked at some perceived 'bad actor' who is using robots.txt as an attempt to "block people from getting at stuff" but it's super misguided in that it ignores an entire purpose of robots.txt that is not even necessarily adversarial to the "robot."

This whole thing could have been a single sentence: "Robots.txt has a few competing vague interpretations and is voluntary; not all bots obey it, so if you're fully relying on it to prevent a site from being archived, that won't work."

4 hours ago

[-]

Correct.

That has been one of the biggest uses -- improve SEO by preventing web crawlers from getting lost/confused in a maze of irrelevant content.

hosh

5 hours ago

[-]

I absolutely will use a robots.txt on my personal sites, which will include a tarpit.

This has nothing to do with keeping my webserver from crashing, and has more to do with crawlers using content to train AI.

Anything I actually want to keep as a legacy, I’ll store with permanent.org

[2] - https://en.wikipedia.org/wiki/Robots.txt#Crawl-delay_directi...

Bender

4 hours ago

[-]

Any time I think about robots.txt, I think about a quote from Pirates of the Caribbean. [1] "The only rules that really matter are these: what a man can do and what a man can't do." except that I replace man with bot. Everything should be designed to handle pirates, given the hostile nature of the internet.

To me, robots.txt is a friendly way to say, "Hey bots, this is what I allow. Stay in these lanes including crawl-delay [2] and I won't block you." Step outside and I can put you on an exercise wheel. I know very few support crawl-delay since it is not part of the standard but that is not my problem. Blocking bots or making them waste a lot of cycles or get dummy data or wildly reordering packets or adding random packet loss or slowing them to 2KB/s is more fun for me than playing Doom.

[1] - https://www.youtube.com/watch?v=B4zwh26kP8o [video][2 mins]

knome

5 hours ago

[-]

given this is from a group determined to copy and archive your data with or without your permission, their opinions on the usefulness of ROBOTS.TXT seem kind of irrelevant. of course they aren't going to respect it. they see themselves as 'rogue digital archivists', and being edgy and legally rather grey is part of their self-image. they're going to back it up, regardless of who says they can't.

for the rest of the net, ROBOTS.TXT is still often used for limiting the blast radius of search engines and bot crawl-delays and other "we know you're going to download this, please respect these provisions" type situations, as a sort of gentlemen's agreement. the site operator won't blackhole your net-ranges if you abide their terms. that's a reasonably useful thing to have.

SCdF

5 hours ago

[-]

This wiki page was created in 2011, in case you're wondering how long they've held this position

procaryote

5 hours ago

[-]

Not having things archived because you explicitly opted out of crawling is a feature, not a bug

Otherwise you can whitelist a specific crawler in robots.txt

rzzzt

4 hours ago

[-]

(I understand it is a different entity) archive .org at one point started to honor the robots.txt settings of the website's current owner, hiding archived copies you could browse in the past. I don't know whether they still do this.

jawns

5 hours ago

[-]

Is a person not allowed to put up a "no trespassing" sign on their land unless they have a reason that makes sense to would-be trespassers?

I know that ignoring a robots.txt file doesn't carry the same legal consequences as trespassing on physical land, but it's still going against the expressed wishes of the site owner.

Sure, you can argue that the site owner should restrict access using other gates, just as you might argue a land owner should put up a fence.

But isn't this a weird version of Chesterton's Fence, where a person decides that they will trespass beyond the fenced area because they can see no reason why the area should be fenced?

3 hours ago

[-]

End User License Agreement. To view the contents of this website, you must have purchased one or more 12oz or larger Diet Sprite(TM) products within the last 24 hours. Acknowledge that violations may be referred to local law enforcement for criminal prosecution.

    Accept [ ]       Exit Site [ ]

Share and enjoy.

rglover

5 hours ago

[-]

I see old stuff like this and it starts to become clear why the web is in tatters today. It may not be respected, but unless you have a really silly config (I'm hard-pressed to even guess what you could do short of a weird redirect loop), it won't be doing any harm.

> What this situation does, in fact, is cause many more problems than it solves - catastrophic failures on a website are ensured total destruction with the addition of ROBOTS.TXT.

Of course an archival pedant [1] will tell you it's a bad idea (because it makes their archival process less effective)—but this is one of those "maybe you should think for yourself and not just implement what some rando says on the internet" moments.

If you're using version control, running backups, and not treating your production env like a home computer (i.e., you're aware of the ephemeral nature of a disk on a VPS), you're fine.

[1] Archivists are great (and should be supported), but when you turn it into a crusade, you get foolish, generalized takes like this wiki.

hyperpape

4 hours ago

[-]

Regarding silly configurations: https://danluu.com/googlebot-monopoly/.

bigstrat2003

4 hours ago

[-]

I really lost a lot of respect for the team when I read this page. No matter how good their intentions are, by deliberately ignoring robots.txt they are behaving just as badly as the various AI companies (and other similar entities) that scrape data against the wishes of the site owner. They are, in other words, directly contributing to the destruction of the commons by abusing trust and ensuring that everyone has to treat each other as a potential bad actor. Dick move, Archive Team.

akk0

4 hours ago

[-]

Mind you're reading a 14 year old page. I honestly don't see any value in this being posted on HN.

zzo38computer

3 hours ago

[-]

I think it is not so right. It can be useful to exclude some kinds of dynamic files, files with redundant pieces in URLs (e.g. a query string for files that do not require it; if you could reasonably do it (which in many cases you unfortunately can't), you might make it never crawl URLs with a query string), to set crawl delays, etc. (You might also want to e.g. use other ways of mirroring some files; e.g. for a version control repository, you do not need to crawl the web pages and you can just clone the repository instead. This way, you will only need to clone the new and changed files and not all of them.)

Robots.txt should not be used for preventing automated access in general or for disallowing mirrors to be made.

Someone else wrote "I actually don't care if someone ignores my robots.txt, as long as their crawler is well run." I mostly agree with this, although whoever wrote the crawler does not know everything (but neither does the server operator).

In writing the specification for the crawling policy file for Scorpion protocol, I had tried to make some things more clear and avoid some problems, although it is not perfect.

dang

5 hours ago

[-]

ROBOTS.TXT is a suicide note - https://news.ycombinator.com/item?id=13376870 - Jan 2017 (30 comments)

Robots.txt is a suicide note - https://news.ycombinator.com/item?id=2531219 - May 2011 (91 comments)

madamelic

5 hours ago

[-]

robots.txt is the digital equivalent of "one piece per person" on an unwatched Halloween bowl.

The people who wouldn't don't need the sign, the people who want to do it anyway.

If you don't want crawling, there are other ways to prevent / slow down crawling than asking nicely.

blipvert

4 hours ago

[-]

Alternatively, it’s the equivalent of having a sign saying “Caution, Tarpit” and having a tarpit.

You’re welcome to ride if you obey the rules of carriage.

Don’t make me tap the sign.

notatoad

4 hours ago

[-]

>Alternatively, it’s the equivalent of having a sign saying “Caution, Tarpit”.

yeah, the fact that it is actually useful for blocking crawlers is kind of a misleading thing. it's called "robots.txt", it's there to help the robots, not to block them. you use it to help a robot crawl your site more efficiently, and tell them what not to bother looking at so they don't waste their time.

people seem to have forgotten really quickly that making your website as accessible as possible to crawlers was actually considered a good thing, and there was a whole industry around optimizing websites for search engines crawlers.

kazinator

5 hours ago

[-]

If you don't obey someone's robots.txt, your bot will end up in their honeypot: be prepared for zip bombs, or generated infinite recursions and whatnot. You better have good countercountermeasures.

robots.txt is helping you identify which parts of the website the author believes are of interest for search indexing or AI training or whatever.

fetching robots.txt and behaving in a conforming manner can open doors for you. If I spot a bot like that in my logs, I might whitelist them, and feed them a different robots.txt.

4 hours ago

[-]

tbf most bots do that nowadays.

btilly

4 hours ago

[-]

Whatever we think of archive.org's position, modern AI companies have clearly taken the same basic position. And are willing to devote a lot more resources to vacuuming up the internet than crawlers did back in 2011.

See https://news.ycombinator.com/item?id=43476337 for a random example of a discussion about this.

My personal position is that robots.txt is useless when faced with companies who have no sense of shame about abusing the resources of others. And if it is useless, there isn't much of a point in having it. Just make sure that nothing public facing is going to be too expensive for your server. But that's like saying that the solution to thieves is to not carry money around. Yes, it is a reasonable precaution. But it doesn't leave me feeling any better about the thieves.

xg15

5 hours ago

[-]

Set up a tarpit, put it in the robots.txt as a global exclusion, watch hilarity ensue for all the crawlers that ignore the exclusion.

Permik

2 hours ago

[-]

The real question is that why haven't we moved robots.txt to DNS TXT records? This way the consent to scrape would be crystal clear even before they connect to the server.

gmuslera

4 hours ago

[-]

robots.txt takes as assumptions that are well-meant, and respectful to the site intentions, major players, that try to index/mirror sites while avoiding overwhelming it and accessing only what is supposed to be freely accessed. Using a visible user-agent, having a clearly defined IP block for doing those scans, and a method of scanning goes in the same direction of cooperating with the site owner to both get visibility while not affecting (a lot) functionality.

But that doesn't mean that there aren't bad players, that ignore the robots.txt, give random user agent strings, or connects from IPs from all the world to avoid being blocked.

LLMs has changed a bit the landscape, mostly because far more players want to get everything or have automated tools to search your information on specific requests. But that doesn't rule out that still exist well-behaved players.

spaceport

4 hours ago

[-]

Renaming my robots.txt to reeeebots.txt and writing a justification line by line on why XYZ shouldn't be archived is now on my todo. Along with adding a tarpit.

rolph

5 hours ago

[-]

the archiveteam statements in the article are sure to win special attention, i think this could be footgunning, and .IF archiveteam .THEN script.exe pleasantries.

Sanzig

5 hours ago

[-]

Ugh. Yeah, this misses the point: not everyone wants their content archived. Of course, there are no feasible technical means to prevent this from happening, so robots.txt is a friendly way of saying "hey, don't save this stuff." Just because theres no technical reason you can't archive doesn't mean that you shouldn't respect someone's wishes.

It's a bit like going to a clothing optional beach with a big camera and taking a bunch of photos. Is what you're doing legal? In most countries, yes. Are you an asshole for doing it? Also yes.

giancarlostoro

5 hours ago

[-]

Its mostly for search engines to figure out how to crawl your website. Use it sparingly.

layer8

5 hours ago

[-]

(2011)

rafram

5 hours ago

[-]

Thanks, added to title.

soiltype

5 hours ago

[-]

I have more complaints about this shitty article than it is worth. At least it's clearly a human screed, not LLM generated.

Just say you won't honor it and move on.

karaterobot

3 hours ago

[-]

> Precisely one reason comes to mind to have ROBOTS.TXT, and it is, incidentally, stupid - to prevent robots from triggering processes on the website that should not be run automatically

Counter-point: I have a blog I don't want to appear on search engines because it has private stuff on it. 25 years ago I added two lines to robots.txt file, and I've never seen it show up on any search engine ever since.

I'm not pretending nobody has indexed my blog and kept a copy of the results. I'm just saying the blog I started in college doesn't show up when you search for my name on Google, which is all I care about.

mnw21cam

2 hours ago

[-]

Admittedly, the whole stupidity of the article aside, they do make a valid point with their "triggering processes" bit. GET should not perform side-effects. That's what POST is for.

Also, for your blog, have you considered password-protecting it? As in, http passwords, which are still (surprisingly) a thing. You can even have a password-free landing page handing the password out to every human that visits, with an onward link to the password-protected site. That should stop the bots but keep letting the humans in.

zzo38computer

2 hours ago

[-]

> GET should not perform side-effects. That's what POST is for.

I agree, although not all triggered processes are things that cause changes; some might be calculations that are unnecessary for crawling, and other stuff like that.

> Also, for your blog, have you considered password-protecting it? As in, http passwords, which are still (surprisingly) a thing. You can even have a password-free landing page handing the password out to every human that visits, with an onward link to the password-protected site. That should stop the bots but keep letting the humans in.

I think it is a reasonable idea, although you can also add text into the password prompt that you can use to figure out the password (as the "realm" text). This also means that you will not need to use cookies.

josefritzishere

3 hours ago

[-]

It's a shame that most companies that ignore robots.txt also operate horribly behaved crawlers. I feel like they forfeit the moral high ground to render judgement.

_Algernon_

5 hours ago

[-]

I mean the main reason is that robots.txt is pointless these days.

When it was introduced, the web was largely collaborative project within the academic realm. A system based on the honor system worked for the most part.

These days the web is adversarial through and through. A robots.txt file seems like an anachronistic, almost quaint museum piece, reminding us of what once was, while we stoop head first into tech feudalism.

RajT88

5 hours ago

[-]

In fact the problem of the "never ending September" has evolved into, "the never ending barrage of Septemberbots and AI vacuum bots".

The horrors of the 1990's internet is quaint by comparison to the society level problems we now have.