Category Archives: Research Notes

Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With The Open Access Signalling Project

In what could easily be a recurring annual trip,Matt Senate, and I came to Berlin this week to participate in Open Knowledge Festival. We spoke at the csv,conf a fringe event in its first year, ostensibly about the comma separated values, but more so about unusual data hacking. On behalf of WikiProject Open Access – Signalling OA-ness team, we generalized our experience in data-munging with Wikimedia projects for the new user. We were asked to make the talk more story-oriented than technical; and since we were in Germany, we decided to use that famous narrative of Häskell and Grepl.… Read the rest

Wiki-Class Set-up Guide and Exploration

Best viewed with IPython Notebook Viewer


Wiki-Class Set-up Guide and Exploration

Wiki-Class is python package that can determine the quality of a Wikipedia page, using machine learning. It is the open-sourcing of the Random Forest algorithm used by SuggestBot. SuggestBot is an opt-in recommender to Wikipedia editors, offering pages that need work which look like pages they’ve worked on before. Similarly, with this package, you get a function that accepts a string of wikitext, and returns a Wikipedia Class (‘Stub’, ‘C-Class’, ‘Featured Article’, etc.). Wiki-class is currently in alpha according to its packager and developer [@halfak](, and although I had to make a few patches to get some examples to work, it’s ready to start classifying your wikitext.

Read the rest

Method of Reflections: Explained and Exampled in Python

The introduction of post is mirrored here, but the full tutorial is on IPython Notebook Viewer.

Method of Reflections Explained and Exampled in Python


See how the Method of Reflections evolves as a recursive process.
See how the Method of Reflections evolves as a recursive process.

The Method of Reflection (MOR) is a algorithm first coming out of macroeconomics, that ranks nodes in a bi-partite network. This notebook should hopefully help you implement the method of reflection in python. To be precise, it is the modified algorithm that is proposed by Caldarelli et al., which solves some problems with the original Hidalgo-Hausmann (HH) algorithm doi:10.1073/pnas.0900943106. The main problem with (HH) is that all values converge to a single fixed point after sufficiently many iterations.

Read the rest

The Virtuous Circle of Wikipeda: The Poster

It may seem like a small piece of work, but I wanted to commemorate this moment – my first poster. I never had the need to manufacture one. Today I presented it at NetSci (Network Science) 2014, and received many useful comments on the research. We found a few other that are, like ourselves, translating  the ‘method of reflections’ into new domains. The paper related to this poster is in review, but you can also access a preprint files on github.

On the art side I’d like to thank unluckylion, for encouraging me to make a bold statement. I think it paid off, and I’m only mildly guilty about the blatant copyvio of the Wikipedia logo.… Read the rest

The Listiness of Wikipedia

View this report with the Ipython Notebook Viewer (where it looks best).

The Listiness of Wikipedia

Although it was only an aside, an answer of "What is a Reference work?" caught my attention at UC Berkeley iSchool's March 21st Friday Afternoon Seminar by Michael Buckland. One possible answer suggested was: works that are over 80% list.
Bates' classification of References works search patterns.
That definition, although seeming a bit short, was actually serious suggestion published by Marcia Bates in 1984. [Bates, Marcia J. "What Is a Reference Book: A Theoretical and Empirical Analysis." RQ 26 (Fall 1986): 37-57] This is an elegant solution in my opinion as a way to define reference works because although heuristic, it's entirely quantitative.
Read the rest

The Topmost Cited DOIs on Wikipedia

You’re surfing a topic of great interest to you on Wikipedia, so interesting that you actually click through to the references. You’re excited to read the original material, but all of a sudden you are foiled—you’ve hit a paywall! And $35 to read an article is just too steep.

This image of Xanthichthys ringens is sourced from an open-access scholarly article licensed for re-use. How can we make that reusability explicit when citing this source in Wikipedia articles? For further details, see this Signpost op-ed by Daniel Mietchen.

The Wikipedia Open Access Signalling Project, which I’ve recently joined, sees this as a fantastic opportunity to spread the word about the Open Access (OA) movement.… Read the rest

Kumusha Takes Wiki: Actionable Metrics for Uganda and Côte d’Ivoire

Live version available with github and IPython nbviewer.

Read the rest

code4lib – VIAFbot and the Integration of Library Data on Wikipedia


In issue 22 of code{4}lib journal, the publication focused on libraries, technology and the future, I published with Alex Kyrios.

(A Japanese review of the paper, by National Diet Library also available.)

VIAFbot and the Integration of Library Data on Wikipedia

This article presents a case study of a project, led by Wikipedians in Residence at OCLC and the British Library, to integrate authority data from the Virtual International Authority File (VIAF) with biographical Wikipedia articles. This linking of data represents an opportunity for libraries to present their traditionally siloed data, such as catalog and authority records, in more openly accessible web platforms.

Read the rest


You might be perusing through the latest issue of Refer Journal and come across my latest article Wikipedia in the Library. Andrew Gray of the British Library and I focus on the need and opportunity of bringing Library data in Wikipedia. Form the introduction,

Wikipedia has traditionally been a divisive topic among librarians and academics. Its goal is undeniably positive and almost utopian – access to all of human knowledge, in every language, offered freely to the world. In practice, however, it can typify “the problem of the internet” – a morass of disorganised information, of dubious accuracy and reliability, offered up without authority or control.

Read the rest

The Most Unique Wikipedias According To Wikidata

If you read Wikipedia in a more than one language you’ll have noticed the sidebar sometimes ready to link you to the topic of the current article in one or more other languages. If you’ve been following the trends you’ll know that Wikidata is now in charge of keeping these language links in order. (To understand more about how Wikidata works watch my youtube tutorial starting at 5:15) One upshot of that is that we can easily count these links and understand more about the Wikipedia projects – like how “unique” different Wikipedias are. I define a unique Wikidata Item of a language X to be a Wikidata Item that has only one language link, and the language link is in language X.… Read the rest