By max, Wed 22 May 2013, in category Research-notes
Have you ever wondered what the word FOR every language was IN every language?
Data mining Wikidata could give us the answer. Using a new file released by Denny Vrandecic containing the Wikidata Item for each Wikidata Language I was able to create a full matrix of all the combinations.
Let L be the set of all Wikipedias. For every language X and language Y in L, does X have a label for Y? Create a matrix of all the possibilities, and if X has a label Y let's colour that part of the matrix magenta, if not let's colour it cyan. Therefore you get a heatmap displaying whether the language on the X axis has a page for the language on the Y axis.
A heatmap displaying whether the language on the X axis has a page for the language on the Y axis.
What interesting things do we find here? Well, most notably is the prominent diagonal magenta line. That's reassuring. The main diagonal of this matrix represents whether each language has a Wikidata label about itself - and it almost always does. The vertical lines show us which languages have good coverage of other languages. And the horizontal lines show us which languages have good coverage by other languages.
This should serve as a reminder that Wikidata is going to be outrageously powerful research tool.
Can you see any other patterns emerging? (The UTF-8 CSV Matrix, and sourcecode.)
Notconfusingly yours.