Multilingual Ikiwiki
Some notes on making the wok properly multilingual
Introduction
The Wok is multilingual. Until recently, I didn't particularly bother about specifying the language of any article I wrote (curiously, with the exception of the one about writing SVG by hand, which is one of the first —but not the first!— of my English-language articles).
At first, this was because most of it was in Italian, but as English content grew, I started pondering about it.
Curiously, IkiWiki does have some support for multi-lingual content,
through its po plugin, but it's geared towards the original scope of the platform
(wikis), as a means to provide an interface to content available in multiple languages
(i.e., basically, translated versions of the same page or article).
This is not what I need, which is instead a way to mark independently the language used for each page.
Interesting, I'm not the only one with such a need,
but the feature is still not readily available in IkiWiki.
And still, it may be considered a rather important feature since,
per the WCAG,
pages are required
to have a programmatically-discoverable language, and adding the relevant attribute (lang)
is the way to do it.
So, today, I finally took the plunge and started to work on adding language declaration support in IkiWiki. This is currently implemented in the fork of IkiWiki I run for the Wok. I may even propose it for upstreaming one of these days.
The interface
Declaration of the article language is integrated in the meta plugin,
consistently with the author, date, title and most user-managed article metadata:
trivially and unsurprisingly, you specify the language with [[!meta lang="lang-code"]]
where the lang-code is the language code (see MDN for details).
(My implementation currently doesn't validate the language code, although it does restrict it to upper- and lowercase basic Latin letters, numbers, and the dash sign.)
The language code is exposed to the IkiWiki template system as PAGELANG,
but the default template hasn't been changed to accommodate for it (yet).
On my custom templates, I use it to add the appropriate lang attribute
to the article tag for both main and inline pages.
And this seems to be sufficient.
Tagging the Wok
The next step was obviously adding the appropriate language tag to all existing Wok articles. Since these now count in the hundreds I had no intention whatsoever to tag them manually myself, so I've relied on a couple of scripts.
The first is a trivial Python wrapper around the langdetect Python library,
appropriately called langdetect.
There's definitely room for improvement, but the following was sufficient for me:
#!/usr/bin/python3
import sys
import langdetect
langdetect.DetectorFactory.seed = 0
if len(sys.argv) > 1:
for file in sys.argv[1:]:
with open(file, 'r') as f:
print('{}\t{}'.format(file, langdetect.detect(f.read())))
else:
sample = sys.stdin.read()
print(sample)
print([ (l.prob, l.lang) for l in langdetect.detect_langs(sample) ])
This prints the file name and the best guess at the file language for any file passed on the command line, or candidates and their probability of the piped text if text is piped through it. Of course, for our purposes we only care about the first function.
We need another script to go over all of the documents in the Wok source, and add the tag for the guessed language. I rolled my own as:
#!/bin/sh
find -name \*.mdwn | while read fname ; do
if grep -q -F '[[!meta lang=' "$fname" ; then
printf '%s SKIPPED (lang already set)\n' "$fname"
else
if grep -q -F '[[!meta author=' "$fname" ; then
lang="$(langdetect "$fname" | awk -e '{print $NF;}')"
printf 'setting %s to lang %s\n' "$fname" "$lang"
sed -i.nolang -e '/\[\[!meta author=/a\[[!meta lang="'"$lang"'"]]' "$fname"
else
printf '%s SKIPPED (no author)\n' "$fname"
fi
fi
done
This isn't the fastest, but it does its job.
The downside? Unless your articles set their publication date manually, this will mark all of them as updated, in random order. Which is going to wreck havoc to some inline page ordering.
In my case, I've gone through all the pages that don't set their date, and updated it manually. Not my happiest moment.