Shields up, part 2
More aggressive defense against the LLM scraper flood.
We're going to need bigger shields
As I mentioned the last time, my biggest issue with the wave of LLM scrapers isn't even with this particular website of mine, but with my gitweb instance. This is in large due to the fact that while the Wok is static (and thus relatively cheap to host even if scraped aggressively), the gitweb instance isn't, as it some links allow it to serve some pretty hefty binary blobs. And the I mentioned in my previous post on the topic wasn't cutting it anymore. It was time to go for something much more strict.
The solution I've decided to go with this time has been to go limit most gitweb commands
to only serve the correct data if an appropriate referral is given,
i.e. if the request was originated from a page for which it made sense.
To wit, a request to e.g. download the full archive for one of the repositories I host
should only be triggered by following a link from the corresponding project page:
anything else will be assumed to be spurious, and will instead return a 401 Unauthorized error.
As usual, the presence of this error will then lead to a 7-day ban for that IP. The system has been up for approximately 3 hours at the time of writing, and it has already caught over 9000 (in fact, nearly 13,000). I'm guessing the next step would be to collect some subnet information about this huge list of IPs and proceed to ban the entire subnet.
This is a bit dangerous
To be honest, this is as successful as it is dangerous. Browser themselves often don't send the correct referrer information “for privacy”, so even a human genuinely ending up on my gitweb using an overprotective browser will end by gaining a 7-day ban. The only thing I can say is: sorry, the bot flood has made your access pattern largely indistinguishable from that of any of these bots, so you'll have to rethink your browser habits.
I wonder if I can count on the presence of the Do Not Track header in this case? Does anybody know if bots send it?
I'm also wondering if the 401 error page I send should also use the same trick as the neverlink tarpit I mentioned in the previous post, and “bleed out” the response very slowly.
I'll have to think about it.