Sparkling wok, episode 2

Can the Wok sparklines be improved?

← Sparkline rivelatrice | | Sparkling wok, episode 3 →

As I mentioned when I first introduced them, the sparklines I've introduced for the index pages of the Wok are … satisfactory. I like them better since I've enhanced them with metadata that becomes visible on hover, but I'm still not entirely satisfied with them, and from time to time I consider revisiting the idea.

Rather than the presentation, though, what I'm now rethinking is “what should the sparklines represent?” As I mentioned for the time being I've opted to use git commits as a proxy for activity on the Wok in general. This works reasonably well for the top-level index, but it becomes a weaker proxy in the individual categories, where I may not be as interested in considering minor fixes (typos, tag case adjustments, and the like) whose commits at times span multiple categories simply because I've opted to introduce a similar change across the whole Wok, but no category-specific content was added or modified.

(And of course, that's without considering the commits where I update the sparklines themselves, which luckily don't affect the category indices at all, only the top-level one.)

An alternative approach would be to build the sparkline from the date and updated metadata of each post that has these fields. This would give sparser sparklines, even possibly too sparse, as it would miss intermediate commits of drafts that I've worked on over several days, something which I often do for longer content (there's articles and other works that have been sitting around as drafts for years now). On the other hand, for readers it would make more sense as it would reflect when new content deemed significant was added to the Wok.

On the plus side, the data itself is trivial to get, and doesn't even need git. It would be something like this:

grep -h -r -E 'meta (date|updated)=' *
  | cut -d'"' -f2
  | cut -f1,2 -d-
  | awk '{ c[$1] += 1 } END { for (v in c) print v, c[v]; }'
  | sort -n

Of course this still needs to be converted to the HTML-interactive sparkline, which we can do ripping the logic I had implemented in my git chart. And since we're going through awk anyway, we might as well do it all there. This requires some care, because we still want to process months with zero data, which isn't included in our array. This means that within awk itself we must process dates in sorted order, filling gaps, and we need the maximum value to scale the counts.

Both of these can be achieved by sorting the c array, values and indices, in two different steps. We don't want to destroy the original array, and we only sort by values to get the maximum, so we can recycle the “sorted” array, with something like:

len = asort(c, dates);
max = dates[len];
asorti(c, dates);

which now gives a sorted array dates, that we can traverse to get the commit dates (in year-month format) from the oldest to the most recent. An iteration of this arrays allows to easily get for each date the number of commits and the scaled size:

for (i in dates) {
    date = dates[i];
    count = c[date];
    scaled = int((8*count + max - 1)/max);
    # TODO output date, count and scaled here
}

with the caveat that dates with no commits (that would give a null count and scaled) are not represented.

Since we do want to fill the holes, instead of iterating over the dates array, we can use a slightly different logic: we fetch the year and month of the start of the series, and the year and month of the end of series. Then we simply step through each month, switching to the next year when necessary. This also integrates well with the logic we will need to open and close the year blocks in the output HTML, which we assume is managed by some beginyear() and endyear() functions.

Getting the first year and month in numeric form can be done with something like this

split(dates[1], ym, "-");
firstyear = 0 + ym[1];
firstmonth = 0 + ym[2];

and similarly for the last. The logic is then something like the following:

year = firstyear;
month = firstmonth;
beginyear(year);
while (1) {
    date = sprintf("%4d-%02d", year, month);
    count = c[date];
    scaled = int((8*count + max - 1)/max);
    output_block(year, month, date, count, scaled);
    if (year == lastyear && month == lastmonth) {
        break;
    }

    ++month;
    if (month == 13) {
        endyear(zwsp);
        month = 1;
        ++year;
        beginyear(year);
    }
}
endyear();

where zwsp is a constant holding the zero-width space we use to allow wrapping between years, and output_block() is the function that prints out the Unicode block element for the given value, with any HTML metadata attached.

To make the awk script a bit more generic, we can make it a little more aggressive in the “capturing” phase. Instead of a simple { c[$1] += 1} which expects input in the form

YYYY-MM optional junk that will be removed

we can make it seek for anything that looks like YYYY-MM with something like

if (match($0, "[0-9]{4}-[0-9]{2}")) {
    date = substr($0, RSTART, RLENGTH);
    c[date] += 1
}

This is possibly a bit too aggressive, but allows us to pipe anything that outputs a date per line to the script, and get the HTML sparkline for the counts of lines grouped by year and month. You can find the complete script here, and I can use as

git log --pretty=format:%as | ./sparkline.awk

to get the commit-based sparkline, and as

grep -h -r -E 'meta (date|updated)=' * | ./sparkline.awk

to get the date-based times.

It's fascinating to see the difference between the two. At the moment, the commit timeline looks like this:

▃▁▂▂▄▇▃▁▂ ▄▂▁▁▁▂▄▃▂▂▇█▂▅▄▆▃▂▁▅▃▂▁▁▁▂▂▂▁▁ ▂▂▁ ▂▃▁▂▁ ▁ ▁▁ ▁▂▂▁▁ ▃▁ ▁▁▁▁▂ ▁▁▁▁ ▁ ▁▁▁▁ ▁ ▁▁ ▁ ▁ ▁▁▁▁▁▂▂▁▁▁▁▁▄▁▁▁▁▂▂▁▂▁ ▁▁▁▁ ▂ ▁ ▁▁ ▁▁▁ ▁ ▁▂▂▂▂▂▁▁▁▄▁▁▂▂▁▁▁▁▂▁▂▅▆

while the dates sparkline looks like this:

▁ ▁▁ ▁▁ ▁▁▁▁ ▁▁ ▁▁ ▁ ▁▁ ▁▁▁ ▁▂▁▂▁ ▁ ▁ ▁ ▁ ▂▁▁▂▃▆▄▁▂ ▂▂▁▂▁▃▅▃▂▁▇█▃▂▃▅▂▁▁▃▃▁▁▁ ▁▁▁ ▂ ▁▃ ▁▁ ▁ ▁▁ ▁ ▄ ▁ ▁ ▁ ▁ ▁ ▁▂ ▁ ▂▂▄▁▁▂▁▂▃▁▂▁ ▁▁▁▂ ▁ ▁ ▁ ▁▁▂ ▁ ▁▂▂▂▂▂▂▁▁▃▂▂▂▃▁▁▁▂▂▁▃▃▇

A significant contribution to the difference is that some of the articles in the Wok are much older than the Wok itself, since they were “revived” from my older blog(s) hosted on now-defunct platforms. For the most part though, in the common years the two sparklines are quite similar, except for a few nodes where there's a distinct difference between the number of commits and the number of posts, highlighting times where there was significant “background” activity (revisions, stylistic changes, and the like) that didn't affect content in a meaningful way.

It should be noted that neither sparkline is actually particularly precise in indicating my activity, since they both skip days where I work on the Wok (or its content) but don't commit the work nor publish a new or updated article (for example, this one took two days to write, but will contribute at the moment of publishing only one commit and one date).

I'm still uncertain about which sparkline to keep as “main”, and I'm actually wondering if I should keep both. However, I suspect that may be too heavy —maybe a separate dedicated page for the stat curious (and myself)?

One thing's for sure, I now have the material to regenerate the sparklines at build time, which should allow me in the near future to remove the ones committed to the repository.

OK that was fast. I have now replaced the “committed” sparklines with some autogenerated ones. The build scripts (both the local one on my machine and the one on the server) have been updated to call the sparkline update script, that now generates both the commit and date sparklines, although the date one is hidden by default.

Ikiwiki has a “transient” page feature for autogenerated pages, but in my case, at least at the moment, these sparklines are not pages on their own, but rather snippets to be inserted in other pages (currently just the index pages, but in the future possibly also in the promised stat page(s)). For this reason, I'm currently abusing the template system instead, which also ensures me that the inclusion of the sparklines does not generate additional markup.

A possibility to show the date rather than commit sparklines, and the stats page, remain as future work.