Against the Web Environment Integrity proposal

In April this year (2023) Google (or at least some of its employees) came out with a proposal for a Web Environment Integrity attestation mechanism, also known as Digital Restrictions Management (DRM) for the web, one of the most outlandish, nightmarish attacks to the open web from a dominant player in both the client and server space.

I've never felt more pressured (by myself) for never having gotten to write a follow-up to my (Italian) article on the dangers of monocultures on the web , a follow-up I've been thinking about for years —since, in fact, WebKit became the most-used rendering engine, even before it was forked by Google into Blink, which was then adopted by large swaths of the FLOSS browsers.

Although I've touched it briefly in my Opera Requiem series, I've never discussed in-depth how dangerous these de facto monopolies are even when the monopolistic product is open source: this is important to mention because many of the more sophisticated users and web developers that fought against the Internet Explorer monopoly settled down when the browser war was “won” by Chrome, on the assumptions that just because the winning browser was FLOSS, there was no danger for the kind of abuses that Microsoft could get away with thanks to the IE6 dominance.

We've seen now and again how this is far from true, and I've several times mentioned my pet peeves (e.g. here): RSS, SVG, SMIL, MathML and more recently JPEG XL. And the fact that Google has gone forward with implementing their own proposal on their Android browser despite the enormous pushback the proposal has received is the umpteenth nail in the coffin of Google's trustfulness with respect to the open web.

I've been aggressively using all the web standards sabotaged by Google on this site, putting “infobox” warnings about possible misrendering due to allegedly ‘modern’ browsers missing support for useful web standards (e.g. here), so the time has come to finally tackle the Web Environment Integrity misfeature. To this end, I've followed the example set by @77nn@livellosegreto.it, taking action through a small piece of JavaScript loaded by all pages that checks for the existence of navigator.getEnvironmentIntegrity.

On this site, if the symbol is found, a JavaScript alert is given, and a banner added to the top of the page, warning about the dangers of using Chrome and recommending switching to a different browser. I may tune the alert and/or message in the future, but ironically I don't seem to have yet a browser that actually supports this harrowing functionality. (For the curious: this is the JavaScript, and this the CSS for the warning box.)

And a side dish for automation

While I was at it, I've finally added a robots.txt control file to keep away well-behaved “machine learning” bots, and I've added the Wok to the Marginalia search engine index. I'm guessing the next step would be to make the site accessible through the Gemini protocol, although for that I would also need either a Markdown to Gemini text transpiler, or to push for web browsers to add support for Gemini while still accepting HTML and friends.

Analytics (2023-08-10 update)

As pointed out by some of my readers, it's a bit hypocritical to rant about Google dominance on the web when I still have Google Analytics tracking code on my website —which was actually something I wanted to mention already on the first draft of this article, but as things went it got late at night and I forgot, so let's have this minor update.

There isn't much to say: GA is a pretty invasive tracker, and I've been intending to replace it for a while now, particularly since finding about Plausible. So the plan is to replace GA with a self-hosted Plausible. The question (I'm not Oblomov for nothing) is when I'll finally get to it. In the mean time, as it happens, GA has actually stopped working since I was still using the old v3 tracker, so I decided to just get rid of it altogether in the mean time. I guess I'll just have to wade through my Apache log for visitor information for the moment.

And still more LLM scrapers (2024-08-29 update)

Last night I noticed that my access logs for the month were exceptionally dense, and I found out that the culprit was a new LLM scraper. For those interested, the User Agent string is

Mozilla/5.0 (compatible) Ai2Bot-Dolma (+https://www.allenai.org/crawler)

The bot does seem to check robots.txt, so I've updated mine, but I don't know if it actually respects it. Let's see if it's sufficient or if a more aggressive approach is needed (scraping seems to have lasted from August 17 to August 26, so I'm not sure if it'll happen again soon, though).

I seriously have to start thinking of a lower-level approach to detect “AI” crawlers and use the detection to poison the contents instead of just asking them to skip the website.

My robots.txt

As of now, for the curious, these are the contents of my robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Ai2Bot
Disallow: /