I have `REQUIRE_SIGNIN_VIEW=true` and I see nothing but my own traffic on Gitea's logs.
Is it because I'm using a subdomain that doesn't imply there's a Gitea instance behind?
> Enable this to force users to log in to view any page or to use API. It could be set to "expensive" to block anonymous users accessing some pages which consume a lot of resources, for example: block anonymous AI crawlers from accessing repo code pages. The "expensive" mode is experimental and subject to change.
Forgejo doesn't seem to have copied that feature yet
That’s the kind of result that ensures we’ll be seeing anime girls all over the web in the near future.
In the article, quite a few listed sources of traffic would simply be completely unable to access the server if the author could get away with a geoblock.
Every country has (at the very least) a few bad actors, it's a small handful of countries that actively protect their bad actors from any sort of accountability or identification.
But the numbers don't lie. In my case, I locked down to a fairly small group of European countries and the server went down from about 1500 bot scans per day down to 0.
The tradeoff is just too big to ignore.
It's funny observing their tactics though. On the whole, spammers have moved from bare domain to various prefixes like @outreach.domain, @msg.domain, @chat.domain, @mail.domain, @contact.domain and most recently @email.domain.
It's also interesting watching the common parts before the @. Most recently I've seen a lot of marketing@, before that chat@ and about a month after I blocked that chat1@. I mostly block *@domain though, so I'm less aware of these trends.
Or I might try and put up Anubis only for them.
I got accidentally locked out from my server when I connected over Starlink that IP-maps to the US even though I was physically in Greece.
As a practical advice, I would use a blocklist for commerce websites, and allowlist for infra/personal.
I'm not saying don't block, just saying be aware of the unintended blocks and weigh them.
This will be in part people on home connections tinkering with LLMs at home, blindly running some scraper instead of (or as well as) using the common pre-scraped data-sets and their own data. A chunk of it will be from people who have been compromised (perhaps by installing/updating a browser add-in or “free” VPN client that has become (or always was) nefarious) and their home connection is being farmed out by VPN providers selling “domestic IP” services that people running scrapers are buying.
Much more likely are those companies that pay people (or trick people) into running proxies on their home networks to help with giant scrapping projects what want to rotate through thousands of "real" IPs.
I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN.
No client compromise required, it's a networking abuse that gives you good reputation of you use mobile data.
But yes, selling botnets made of compromised devices is also a thing.
The cost of the open, artisanal web has shot up due to greed and incompetence, the crawlers are poorly written.
Anyway, test some scrapers and bots here [1] and let me know if they get through. A successful response will show "Can your bot see this? If so you win 10 bot points." and a figlet banner. Read-only SFTP login is "mirror" and no pw.
[Edit] - I should add that I require bots to tell me they speak English optionally in addition to other languages but not a couple that are blocked, e.g. en,de-DE,de good, de-DE,de will fail, because. Not suggesting anyone do this.
My company runs our VPN from our datacenter (although we have our own IP block, which hopefully doesn’t get blocked)
Those with revenue generating systems should capture TCP SYN traffic for while, monitor access logs and give it that college try to correlate bots vs legit users with traffic characteristics. Sometimes generalizations can be derived from the correlation and some of those generalizations can be permitted or denied. There really isn't a one size fits all solution but hopefully my example can give ideas in additional directions to go. Git repos are probably the hardest to protect since I presume many of the git libraries and tools are using older protocols and may look a lot like bots. If one could get people to clone/commit with SSH there are additional protections that can be utilized at that layer.
[Edit] Other options lay outside of ones network such as either doing pull requests for or making feature requests for the maintainers of the git libraries so that HTTP requests look a lot more like a real browser to stand out from 99% of the bots. The vast majority of bots use really old libraries.
I'm also left wondering about what other things you could do? For example - I have several friends that built their own programming languages, I wonder what the impact would be if you translate lots of repositories to your own language and host it for bots to scrape? Could you introduce sufficient bias in a LLM to make an esoteric programming language popular?
it's called "LLM grooming"
https://thebulletin.org/2025/03/russian-networks-flood-the-i...
> undermining democracy around the globe is arguably Russia’s foremost foreign policy objective.
Right, because Russia is such a cartoonish villain it has no interest in pursuing its own development and good relations with any other country, all it cares about is annoying the democratic countries with propaganda about their own messed up politics.
When did it become acceptable for journalists to make bold, generalizing claims against whole nations without a single direct, falsifiable evidence of what they claim and worse, making claims like this that can be easily dismissed as obviously false by quickly looking at the policies and their diplomatic interactions with other countries?!
That's actually pretty much spot on.
yes, i know, it's not a linear axis, it's multi-dimensional perspective thing. so do a PCA/projection and spit one number, according to your values/beliefs
The fact that you think this is something to do with "both sides" instead of a simple question of facts really gives you away.
Wasn't there a study a while back showing that a small sample of data is good enough to poison an LLM? So I'd say it for sure is possible.
My personal services are only accessible from my own LAN or via a VPN. If I wanted to share it with a few friends I would use something like Tailscale and invite them to my tailnet. If the number of people grows I would put everything behind a login-wall.
This of course doesn't cover services I genuinely might want to be exposed to the public. In that case the fight with the bots is on, assuming I decide I want to bother at all
I do wonder though. Content scrapers that truly value data would stand to benefit from deploying heuristics that value being as efficient as possible in the info per query space. Wastefullness of the desctbed type not just loads your servers, but also their whole processing pipeline on their end.
But there is a different class of player that gains more from nuisance maximization: dominant anti-bot/ddos service providers, especially those with ambitions of becoming the ultimate internet middleman. Their cost for creating this nuisance is near 0 as they have 0 interest in doing anyting with the responses. They just want to annoy until you cave and install their "free" service, then they can turn around as ask for a pay to access your data to interested parties.
And what is the effect?
I opened https://iocaine.madhouse-project.org/ and it gave the generated maze thinking I'm an AI :)
>If you are an AI scraper, and wish to not receive garbage when visiting my sites, I provide a very easy way to opt out: stop visiting.
If you mean scrapers in terms of the bots, it is because they are basically scraping web content via HTTP(S) generally, without specific optimisations using other protocols at all. Depending on the use case intended for the model being trained, your content might not matter at all, but it is easier just to collect it and let it be useless than to optimise it away⁰. For models where your code in git repos is going to be significant for the end use, the web scraping generally proves to be sufficient so any push to write specific optimisations for bots for git repos would come from academic interest rather than an actual need.
If you mean scrapers in terms of the people using them, they are largely akin to “script kiddies” just running someone else's scraper to populate their model.
If by scrapers in terms of people writing them, then the fact that just web scraping is sufficient as mentioned above is likely the significant factor.
> why the scrappers do not do it in a smarter way
A lot of the behaviours seen are easier to reason if you stop considering scrapers (the people using scraper bots) to be intelligent, respectful, caring, people who might give a damn about the network as a whole, or who might care about doing things optimally. Things make more sense if you consider them to be in the same bucket as spammers, who are out for a quick lazy gain for themselves and don't care, or even have the foresight to realise, how much it might inconvenience¹ anyone else.
----
[0] the fact this load might be inconvenient to you is immaterial to the scraper
[1] The ones that do realise that they might cause an inconvenience usually take the view that it is only a small one, and how can the inconvenience little them are imposing really be that significant? They don't think the extra step of considering how many people like them are out there thinking the same. Or they think if other people are doing it, what is the harm in just one more? Or they just take the view “why should I care if getting what I want inconveniences anyone else?”.
Recognize that a website is a Git repo web interface. Invoke elaborate Git-specific logic. Get the repo link, git clone it, process cloned data, mark for re-indexing, and then keep re-indexing the site itself but only for things that aren't included in the repo itself - like issues and pull request messages.
The scrapers that are designed with effort usually aren't the ones webmasters end up complaining about. The ones that go for quantity over quality are the worst offenders. AI inference-time data intake with no caching whatsoever is the second worst offender.
That doesn't even sound all that bad if you happen to catch a human. You could even tell them pretty explicitly with a banner that they were browsing the site in no-links mode for AI bots. Put one link to an FAQ page in the banner since that at least is easily cached
Failing that I would use Chrome / Phantom JS or similar to browse the page in a real headless browser.
Surely someone would write a scraper to get around this, but it couldn't be a completely-plain https scraper, which in theory should help a lot.
They don't even use the Wikipedia dumps. They're extremely stupid.
Actually there's not even any evidence they have anything to do with AI. They could be one of the many organisations trying to shut down the free exchange of knowledge, without collecting anything.
For what it's worth, most requests kept coming in for ~4 days after -everything- returned plain 404 errors. millions. And there's still some now weeks later...
"This is depressing. Profoundly depressing. i look at the statistics board for my reverse-proxy and i never see less than 96.7% of requests classified as bots at any given moment. The web is filled with crap, bots that pretend to be real people to flood you. All of that because i want to have my little corner of the internet where i put my silly little code for other people to see."
do we want to change that? do we want to require scrapers to pay for network usage, like the ISPs were demanding from Netflix? is net neutrality a bad thing after all?
When scraping was mainly used to build things like search indexes which are ultimately mutually beneficial to both the website owner and the search engine, and the scrapers were not abusive, nobody really had a problem.
But for generative AI training and access, with scrapers that DDoS everything in sight, and which ultimately cause visits to the websites to fall significantly and merely return a mangled copy of its content back to the user, scraping is a bad thing. It also doesn't help that the generative AI companies haven't paid most people for their training data.
But for some reason corporations don't want that, I guess they want to be allowed to just take from the commons and give nothing in return :/
b) They have a complete lack of respect for robots.txt
I'm starting to think that aggressive scrapers are part of an ongoing business tactic against the decentralized web. Gmail makes self hosted mail servers jump through arduous and poorly documented hoops, and now self hosted services are being DDOSed by hordes of scrapers…
So much real human traffic that it brings their site down?
I mean yes it's a problem, but it's a good problem.
I can't speak for everyone, but the web should be free and scraping should be allowed insofar that it promotes dissemination of knowledge and data in a sustainable way that benefits our society and generations to come. You're doing the thing where you're trying to pervert the original intent behind those beliefs.
I see this as a clear example of the paradox of tolerance.
Also, have you considered Captchas for first contact/rate-limit?
If you have smart scrapers, then good luck. I recall that bot farms use pre-paid SIM cards for their data connections so that their traffic comes from a good residential ASN. They also have a lot of IPs and overall well-made headless browsers with JS support. Then it's a battle of JS quirks where the official implementation differs from headless one.
Is this viable?
no
for many reasons