Shields up
Depending against the LLM scraper flood.
Enough is enough
Scraping the open web for anything that can be fed to the LLMs that are passed off as artificial “intelligence” has become so aggressive that even I have finally come to terms with the need to protect myself and my online presence from it.
I've always had a “moderately tolerant” stance towards these kind of phenomena. For example, I was very late in adopting an ad blocker, because I felt that there was a sense of “equivalent exchange” in benefiting from free content while tolerating ads that I would have gladly gone without; even as the amount and invasiveness of ads grew, I resisted, until it was finally too much, and finally deployed uBlock Origin across all my browsers and machines.
Similarly, I've tolerated scrapers as long as they've been well-behaved, even when questionably more expansive and persistent than search engine web crawlers. But in the last few months, things have changed. Scrapers have increased in numbers, and more and more often are poorly coded enough to bog down my home server in what cannot be describe in any other way than a DoS.
First steps
I had already started setting up some precautionary measures such as the well-known fail2ban intrusion prevention to protect the machine against secure shell exploitation attempts, but that was all. (In fact, in reviewing by fail2ban configuration for what I'm going to discuss, I found out I had been way too tolerant, and have taken the opportunity to tighten that part of the process too, but that's beyond the scope of this article.)
From time to time, however, I was seeing some intense traffic against my gitweb instance that was obviously indiscriminate scraping activity, which would bring the load on the machine to ridiculous levels.
The first serious step was setting up a “manual” category in my fail2ban configuration to jail the most egregious offenders; for the curious, I've banned the entire 146.174.0.0/16 and 202.76.0.0/16 subnets, which may be a bit more aggressive than necessary, but a /8 wasn't enough and I honestly couldn't care enough to find the smallest mask; sorry if anybody got caught.
Because obviously that's part of the problem: to make it harder to use tools like fail2ban, these scrapers implement ban evasion in a number of ways, ranging from credible user agent identifiers (even though access patterns are “obviously” non-human, when seen by a human) to —most importantly— spread out attacks (i.e. more of a DDoS), which forces hosters on the defense, playing whack-a-mole on individual IPs while the attackers (scrapers) keep jumping from one to the next.
Enter ansuz
I'm not the only one with this problem, obviously. JWZ of Netscape and XScreenSaver fame, for example, has written extensively (for example, here's his latest musings on the topic) about the honeypot he has set up to poison the scrapers. (Highly recommend reading the comments too for additional recommendations from other people.)
But arguably, what finally got me to get a move on (aside from an assault peak) were some recent Fediverse posts by @ansuz@social.cryptography.dog that were ultimately collected and expanded into a blog post (with an interesting follow-up.)
The reason why this caught my attention is because it presented a simple (trivial, even) way to catch (some) scrapers: a “neverlink”, i.e. a link that, by virtue of being commented out or explicitly tagged as “not to be followed” and hidden by style, would be invisible to all but the most aggressive scrapers.
(Update: since the nofollow attribute is intended for ranking rather than crawling,
and there is no clear way to indicate that a specific link should not be followed for crawling,
I have also added the neverlinks to robots.txt for exclusion by all user agents.
We'll see if this helps refine their use for scraper detection.)
Since my most heavily bombarded subdomain was the gitweb,
I took the opportunity to update it to the latest and change it so as to add
two neverlinks, a commented link in the head tag, and a nofollow, display: none
one in the body.
Moving forward: the tarpit
I was actually surprised when there were a couple of hits to the (404ing) linked page, so I started working on extending the effectiveness of the “scraper detection”: rather than just banning any IP trying to fetch the neverlink, which would be of limited effectiveness given the extensive use of host jumping that results in each IP fetching a single URL, I took some of my free time yesterday to create a tarpit, something which I had been pondering for months (so yes, arguably, @ansuz's post finally made me do it, and as usual it took months; I'm not Oblomov for nothing).
(Without going too much into details: a honeypot is something that looks palatable, so attachers are encouraged to get there and waste their time getting stuck there; a tarpit is something that is intentionally designed to slow things down.)
The tarpit is a PHP script that provides a standard HTML, but the entire content is (1) randomly generated and (2) at a slow rate, pausing a little bit after each character.
I must say the effect that I'm using to slow things down is actually fascinating to look at, giving a bit of an old “typewriter” effect. But of course it's not there for the aesthetics: the idea is to get the bot hooked up for several seconds, (potentially hours, with millions of characters generated in a single run, if it's ever left to finish, although from what I've seen these bots will generally time out after a few seconds) instead of hammering my servers with hundreds of requests per second.
What will there be next?
Both the tarpit and the scraper detection are under development. I'm trying out a few different ideas to see what's more effective. The new version that is coming up momentarily includes several improvements already.
First of all, neverlinks now come in different flavours (with or without hostname) for two different kinds (head and body).
Secondly, the tarpit itself now includes neverlinks.
And thirdly, since this time the HTML is dynamically generated by PHP, I've added “session IDs” that make them unique. The intent here is to make the scrapers try to keep accessing the tarpit with multiple requests, an idea that is probably better served by adding to the tarpit an infinitely generated maze of twisty little passages, all alike, which I'll most likely end up looking into.
What's missing
With neverlink, catching the stupidest scrapers is easy. With the IP jumping, an effective tarpit needs to detect them on first connection, and with the randomized user agent this is quite non-trivial, since the “hook” is undistinguishable from a human connection (who is subsequently trivial to see from the fetching of related resources such as CSS and JS). This means that post-processing of the logs is necessary to find these patterns, (no first-connection detection), potentially with subnet-wide banning or tarpit redirection. I wonder if it would be possible to add some warning text to the tarpit so that if a human ever gets caught in it they'll see the warning text even while the browser hangs fetching the rest of the page. And is that even worth it?
(And yes, this is why we can't have nice things.)