Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With The Open Access Signalling Project

In what could easily be a recurring annual trip,Matt Senate, and I came to Berlin this week to participate in Open Knowledge Festival. We spoke at the csv,conf a fringe event in its first year, ostensibly about the comma separated values, but more so about unusual data hacking. On behalf of WikiProject Open Access – Signalling OA-ness team, we generalized our experience in data-munging with Wikimedia projects for the new user. We were asked to make the talk more story-oriented than technical; and since we were in Germany, we decided to use that famous narrative of Häskell and Grepl. In broad strokes we go through: how Wikimedia projects work, history of Wiki Data-Hacking, from “Ignore All Rules” to calcification, Wikidata told as Hänsel and Gretel, signalling OA-ness, how you could do it too.

These are the full slides (although slide show does not seem to like our Open Office document so much):

And a crowdsourced recording of the session:

We missed half of lunch with the queue of questions extending past our sessions, which was fabulous to see such interest. There is a particular affinity we found with the Content Mine initiative, which wants to programmatically extract facts from papers. Since we are finding and uploading mine-able papers, you could imagine some sort of suggestion system which says to an editor “you cited [fact x] from this paper, do you also want to cite [extracted facts] in the Wikipedia article too?”. Let’s work to make that system a fact in itself.

Wiki-Class Set-up Guide and Exploration

Best viewed with IPython Notebook Viewer


Wiki-Class Set-up Guide and Exploration

Wiki-Class is python package that can determine the quality of a Wikipedia page, using machine learning. It is the open-sourcing of the Random Forest algorithm used by SuggestBot. SuggestBot is an opt-in recommender to Wikipedia editors, offering pages that need work which look like pages they’ve worked on before. Similarly, with this package, you get a function that accepts a string of wikitext, and returns a Wikipedia Class (‘Stub’, ‘C-Class’, ‘Featured Article’, etc.). Wiki-class is currently in alpha according to its packager and developer [@halfak](https://twitter.com/halfak), and although I had to make a few patches to get some examples to work, it’s ready to start classifying your wikitext.


  1. Setting it up on Ubuntu.
  2. Testing the batteries-included model.
  3. Using the output by introducing a closeness measure.
  4. Testing making our own model.


At first you may be frustrated to learn that Wiki-Class is Python 3 only. You’ll not be able to mix it with pywikibot, which is Python 2.7 only, and that can also mean upgrading some of your other tools. However just try to recall these update gripes next time you encounter a UnicodeError in Python 2.x; and then be thankful to Halfak for making us give Python 3 a try. I outline getting the environment running in Ubuntu 14.04 here.

Firstly, if you want to use the Ipython notebook with python3 you can do so with apt-get. And while we’re at it, for convenince we’ll also install another version of pip for Python 3.

In [95]:
!sudo apt-get install ipython3-notebook python3-pip
[sudo] password for notconfusing: 

Some requirements of Wiki-class, including sklearn, and nltk, which are a pain with Python 3 since they haven’t been properly packaged for it yet. So these you’ll have to get from source:

In [1]:
!pip3 install git+https://github.com/scikit-learn/scikit-learn.git
!pip3 install git+https://github.com/nltk/nltk/#

Making some random pages for a test dataset

We’ll need to get some Wikitext, with associated classifications, to start testing. I elected to make a random datasetin pywikibot, which as already stated is Python 2.7 only, and thus needs to be in a separate notebook, you can view it on the nbviewer still. Its output is a file test_class_data.json (github link of the bzip) which is just a dictionary associating qualities and page-texts.

Warning, this dataset has some examples that can cause a ZeroDivisonError because some of these pages have 0 non-mark-up text. I wrote this patch which fixes this issue.

Testing the Pre-built Model

In [3]:
import json
import pandas as pd
from wikiclass.models import RFTextModel
/usr/local/lib/python3.4/dist-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

Each model is stored in a .model file. A default one is included in the github repo.

In []:
!wget https://github.com/halfak/Wiki-Class/blob/master/models/enwiki.rf_text.model?raw=true
In [35]:
!mv enwiki.rf_text.model\?raw\=true enwiki.rf_text.model

Now we load the model.

In [4]:
model = RFTextModel.from_file(open("enwiki.rf_text.model",'rb'))
In [5]:
classed_items = json.load(open('test_class_data.json','r'))
print(sum([len(l) for l in classed_items.values()]))

The Wiki-Class-provided model only deals with ‘Stub’, ‘Start’, ‘B’, ‘C’, ‘Good Article’, and ‘Featured Article’ classifications. It does not include not ‘List’, ‘Featured List’, or ‘Disambig’ class pages. So we have to sort out the standard classes out of our 38,000 test articles.

In [6]:
standards = {actual: text for actual, text in classed_items.items() if actual in ['Stub', 'Start', 'C', 'B', 'GA', 'FA'] }
In [5]:
print(sum([len(l) for l in standards.values()]))

Now we iterate over our 36,000 standard-class pages, and put their Wiki-Class assessments into a DataFrame.

In [6]:
accuracy_df = pd.DataFrame(index=classed_items.keys(), columns=['actual','correct', 'model_prob', 'actual_prob'])
for actual, text_list in standards.items():
    #see if actual is even here, otherwise no fair comparison
        for text in text_list:
                assessment, probabilities = model.classify(text)
            except ZeroDivisionError:
                #print(actual, text)
            accuracy_df = accuracy_df.append({'actual': actual,
                                              'correct':int(assessment == actual),
                                              'model_prob': probabilities[assessment],
                                              'actual_prob': probabilities[actual]}, ignore_index=True)

What you see here is that the output of an assessment is really two things. The ‘assessment’ which is simply the ‘class’ which the algorithm predicts best, but secondly a dictionary of probablities of how likely the text is to belong to each class.

In our DataFrame we record four data. The ‘actual’ class as Wikipedia classes it; whether the actual class matches the model prediction. The probabilty (read: “confidence”) of the model prediction. And lastly the probability of the actual class. Note in the “correct” case model_prob and actual_prob are the same.

In [7]:
df  = accuracy_df.dropna(how='all')
actual correct model_prob actual_prob
18 Start 0 0.4 0.0
19 Start 1 0.8 0.8
20 Start 0 0.4 0.0
21 Start 0 1.0 0.0
22 Start 1 0.7 0.7

If we look at the correct mean averages we should hopefully see something above 1/6th, which would be the performance of just guessing. Which we do.

In [8]:
groups = df.groupby(by='actual')
B         0.247391
C         0.278138
FA        0.854167
GA        0.444444
Start     0.387334
Stub      0.698394
Name: correct, dtype: float64

See how “close” predications are if they are not correct.

Now we hack on the output. The Random Forest is really just binning text into difference classes, it doesn’t know that some of the classes are closer to each other than others. Therefore we define a distance metric on the Standard Wiki classes. I call this order the “Classic Order” To get an intuition, consider this example. If an article is a Good Aritcle and the model prediction is also Good Article then it is off by 0; if the model prediction is Featured Article it is off off by 1; if the model prediction is Start then it was off by 3.

In [7]:
classic_order = ['Stub', 'Start', 'C', 'B', 'GA', 'FA']
enum_classic = enumerate(classic_order)

for enum, classic in dict(enum_classic).items():
    print(enum, classic)
0 Stub
1 Start
2 C
3 B
4 GA
5 FA

Now we are going to iterate over the same dataset as above, but instead of recording “correctness”, we record the closesness in a DataFrame.

In [8]:
classic_order = ['Stub', 'Start', 'C', 'B', 'GA', 'FA']
classic_dict = dict(zip(classic_order, range(len(classic_order))))

off_by_df = pd.DataFrame(index=classed_items.keys(), columns=['actual','off_by'])

for classic in classic_order:
    for text in standards[classic]:
                assessment, probabilities = model.classify(text)
            except ZeroDivisionError:
                #print(actual, text)
            off_by_df = off_by_df.append({'actual': classic,
                                              'off_by':abs(classic_dict[assessment] - classic_dict[classic])}, ignore_index=True)

So it should look something like this as a table

In [9]:
off_by  = off_by_df.dropna(how='all')
actual off_by
18 Stub 2
19 Stub 1
20 Stub 0
21 Stub 0
22 Stub 0

And as a chart.

In [10]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['text']
`%pylab --no-import-all` prevents importing * from pylab and numpy

We can see that the middle classes are less easy to predict where as the ends are easier. This would corroborate our expectations. Since the the quality sprectrum bleed past these rather arbitrary cut-off points,ore of the quality specturm would lie in these intervals, and so its easier to bin them.

In [11]:
ax = off_by.groupby(by='actual',sort=False).mean().plot(title='Prediction Closeness by Quality Class', kind='bar', legend=False)
ax.set_ylabel('''Prediction Closeness (lower is more accurate)''')
ax.set_xlabel('''Quality Class''')
<matplotlib.text.Text at 0x7fc089810550>

Making a model

Now we test the model-making feature. We will use our dataset of ‘standards’ from above, using a random 80% for training and 20% for testing.

In [27]:
from wikiclass.models import RFTextModel
from wikiclass import assessments

Divvyig up our data into two lists.

In [28]:
import random

train_set = list()
test_set = list()
for actual, text_list in standards.items():
    for text in text_list:
        if random.randint(0,9) >= 8:
            test_set.append( (text, actual) )
            train_set.append( (text, actual) )


And the next step is quite simple, we just click a button supplying our train_set list, and test by supplying our test_set list. Also the package conveniently supplies a saving function for us to store our model for later use.

In [29]:
# Train a model
model = RFTextModel.train(

# Run the test set & print the results
results = model.test(test_set)

# Write the model to disk for reuse.
model.to_file(open("36K_random_enwiki.rf_text.model", "wb"))
pred assessment    B    C  FA  GA  Start  Stub
real assessment                               
B                130   29   1   5    105    40
C                 34  112   0   2    151    33
FA                 7    3   4   0      1     0
GA                 8    8   0  11      9     1
Start             80   87   0   2   1420   525
Stub              40   32   0   0    547  3973

Now to look at accuracy, we norm the DataFrame row-wise.

In [30]:
norm_results = results.apply(lambda col: col / col.sum(), axis=1)
pred assessment B C FA GA Start Stub
real assessment
B 0.419355 0.093548 0.003226 0.016129 0.338710 0.129032
C 0.102410 0.337349 0.000000 0.006024 0.454819 0.099398
FA 0.466667 0.200000 0.266667 0.000000 0.066667 0.000000
GA 0.216216 0.216216 0.000000 0.297297 0.243243 0.027027
Start 0.037843 0.041154 0.000000 0.000946 0.671712 0.248344
Stub 0.008711 0.006969 0.000000 0.000000 0.119120 0.865200

And finally we can view the peformance by class, which intriguingly seems to be better than what we got with the batteries-included model.

In [35]:
for c in classic_order:
    print(c, norm_results.loc[c][c])
Stub 0.865200348432
Start 0.671712393567
C 0.33734939759
B 0.41935483871
GA 0.297297297297
FA 0.266666666667

We can see that, having a large number of stubs to train on really gives us a high precision in classifying them.

So there you have it – a brief playing around with Wiki-Class, an easy way to get rough quality estimates out of your data. If you extend any more examples of using this class, I’d be intrigued to see and collaborate on them.


In []:

What Part of “School” Don’t You Understand?

I received an apologetic email from HackerSchool an hour ago, that was sorry to tell me they couldn’t admit me this fall – quizzically I was not gutted. HackerSchool is part of the wave of “Hacker Education,” where you exchange something with a company for programming education. HackerSchool differentiates in that you don’t pay them upfront, or necessarily at all – they just want a cut of a potential recruiting bonus when they pawn you to another company. They also have good perspectives on lightweight social rules and gender equality which piqued me.  Still, let us not mince, this is private education. A more dedicated Hacker might call it a co-option of DIY, gift-economy culture.

Although this might seem a bitter and fruitless retaliation in response to a rejection letter, that is only the first lily-pad. In fact, there is something more that made HackerSchool particularly attractive to me, which only became apparent in retrospect. As I wrote in my application (which is copied in its entirety below), and previously about dogs, a central conundrum for me is not knowing how to work for myself. I can work to impress authorities, appear clever for narcissistic purposes, or for fear of failure – but not because I want to.

HackerSchool’s “everyone determines their own lesson plan” philosophy could be a vital stepping stone to a dreamed-of autodidacticism. On the face of it, going to HackerSchool even looks like genuine autodidacticism. But closer inspection would reveal that you have an authority (the HackerSchool institution) that instructs you to teach yourself. The outer loop, the most meta-level, is still a deference to a force that isn’t your own. It’s virtualizing self-ownership, which is really just “bluepilling” yourself.

It's virtualizing self-ownership, which is really just
It’s virtualizing self-ownership, which is really just “bluepilling” yourself.


This conflict arose during my 14 minute interview with  the organizers, in the question of “how would I learn,” my favoured topics? “I suppose, I would use textbook as in College,” I replied without confidence, and later amended to “but typically it’s been project-needs plus stackoverflow.”  In both cases I now see I pointed to historic examples where the main motivator was either a professor or boss. Unsurprisingly this was a lacking answer to both myself, and an interviewer not looking to be my professor or boss.

Looking for positive cases of escaping this servitude, there is obviously one classical logic. Accomplishment-desire drives non-authority-pleasing mechanisms of work. But what if we allow that natural curiosity is not the only way to exit the teach-yourself-to-teach-yourself paradox. What could be alternatives? We could use as a starting point the goal vs. process attitude dichotomy. In this framework the ravenous prodigy sits neatly on the “goal” side. And on the other side?

There isn’t a prominent model to represent the unspurred, successful process-worshipper. The best exmaples I can offer are probably something like Aaron Schwartz, Grigori Perelman, or a stereotypical monk. Having such a dearth of role models is probably because process-oriented people aren’t highly lauded in our prize-counting society, and are thus non-notable. This is a dead-end I feel I’ve been running into frequently.

The conclusive feeling here is not directed, but is still a redoubling of effort. It’s a large, and still partially free internet out there. There’s Open Access research to read and write, and Open Source code to execute and develop. Even without the promise of coming to an epiphany of how not to get depressed about the fact that I do it alone in my bedroom, that unlit corridor still calls to me as the one with light at the end.

Should it help anyone fulfil their dreams, and for the sake of radical transparency, this was my HackerSchool Application.

HackerSchool Application


Please include any that you have: GitHub, LinkedIn, personal web site, etc.



https://www.linkedin.com/pub/maximilian-klein/4/b1b/63 Any tips for updating?

http://notconfusing.com (trying to fix width issue on category subpages may need to switch thee)

Code CracklePop

Write a program that prints out the numbers 1 to 100 (inclusive). If the number is divisible by 3, print Crackle instead of the number. If it’s divisible by 5, print Pop. If it’s divisible by both 3 and 5, print CracklePop. You can use any language.


rice = [3,5]

crispies = ['Crackle','Pop']

rice_crispies = dict(zip(rice, crispies))

for i in range(101):


    for flake in rice:

        if i % flake == 0:

            print(rice_crispies[flake], sep='', end='')

            print('', end='\n')


Please link to a program you’ve written from scratch.

You can use something likeGitHub’s gist to host your code. It doesn’t need to be long, but it should be something you’ve written yourself, and not using a framework (e.g., Rails). If you don’t have anything to submit, code something small, like a game of tic-tac-toe.


This tutorial translates an Economic algorithm into Python. In short, it does some matrix-calculations, statistical analysis, and some plotting. Its most advanced language-feature is a python “generator.”


What is the most fascinating thing you’ve learned in the past month?

This doesn’t have to be about programming.


We all know a averaged crowd of fair-goers can guess the weight of a heifer more accurately than any of the individual simpletons among them. Science shows the principle extends to marbles and encyclopedias as well. But what I picked up this month at the Network Science Conference ‘14, was that it can be applied to stock trading too. Diversification strategies work – but they can also be diversified. A team of researchers explained a technique that they simulated. If you followed x traders, mimicking exactly the trades they perform, but with 1/x of your money, then for sufficiently high x, the return is higher than any of the individuals.


The network science bit comes in because you don’t want anyone you follow to be following each other. For the highest return on investment, those who you follow should have “no common ancestors,” in network parlance.


More so than stock trading, the “wisdom of the crowds” theory appeals to me. Trying to make clever stock decisions is a huge industry, and this intuitive simple mechanism can compete with more complex ones. What’s fascinating here to me is how theories can unexpectedly translate between domains.

What do you want to be doing in two years?


Two years from now I would like be swimming through the gooey centre of a large research project at a think tank or in research and development. Stemming from my previous employment at OCLC Research (a library think tank), I enjoy the freedom of blue-sky thinking. Therefore the employers that have a large enough budget for pure research (Microsoft Research is a good example of this) are the competitive waters that I want to compete in. Having such lofty dreams are never regrettable in my experience because there are always failsafes. In this case one can always sell oneself as a Data Analyst for business intelligence.


To enjoy any future work however it would be crucial for me to be in a team of stellar collaborators. My personal adage (which I stole from a guy that works in a copy shop) is “life is the water cool, the water cooler is life.” Being around people ignites my mind (even at the copy shop), and I want to continue fuelling that fire. I will continue to invite uncomfortable differences in perspective. Therefore in two years I want to be in a team that values learning over goals. Goals inevitably follow learning – but not vice-versa.

Why do you want to do Hacker School?

I see Hacker School as the centre part of a venn diagram of my desires which are (1) learning self-directedly (2) being part of a supporting group, and (3) boosting employment opportunity after.


Last year when I went to my boss and asked her to crack the whip on me harder, my own actions perplexed me. I quit my job to attack the problem of relying on authority to motivate my work. But next came the paradox: how can one self-direct one’s self to autodidactically become self-directed? Recursion without a base case.


From a pragmatic perspective I still go to my local hackerspace because I enjoy what could be termed as “co-learning.” The social environment drives me. Being conscious of your impression on others, can psychologically push you to work. It’s not self-actualizing alone in your bedroom, but it’s effective.


Hacker school seems like a realpolitik compromise between bootstrapping self-ownership, and well-proved social dynamics. Given that Hacker School can also help with the personally-dreaded task of a job-search after, I see a trifecta being won.


What would you like to work on at Hacker School?

E.g., things you want to learn or understand better, projects you want to build or contribute to, etc.


While there are a few pet-projects that jump to mind, none are as important as the process of the work I might do. Rather than pronounce any work in detail, I would describe my desires declaratively. There are two main criteria. Firstly, like a carrot just beyond horse-mouth’s reach, I want to find a project that is harder than I expect, simply to level-up. Secondly, is to overcome the folly of the lone-inventor myth. To horizontally work with a partner is as important as being the bringer of the techno-revolution. So the thing I would like to work on is a new idea I would receive while I’m at Hacker School.


That being said, in absence of any external input, a few of the topics I want to understand better are machine learning, genetic algorithms, and pattern recognition. These corners of computer science are somehow just cool. Pursuing them, since they are substantially complex seems commensurate with Hacker School motto of “get dramatically better.”


Also I want to be able to make my phone turn off silent mode by sending it a secret text for those hidden-under-couch situations.

Programming background

This information will not disqualify your application. We use it to better get to know our applicants and where they currently are. If you’re worried that you won’t fit into Hacker School, you can read aboutsome of our alumni.

Describe your programming background in a few sentences.

2006. Failed Java in Community College.

2009. Discover I enjoyed programming Turing Machines on paper in “Computability Theory” in my Pure Math major.

2010. Enroll in – and revel in – the purity of the Berkeley/MIT Scheme tradition.

2011. Fail Java again. Tech career funeral and wake.

2012. Phoenixed with Python + Stackoverflow, to write Wikipedia bots.

2013. Welcome to the FOSS movement. Linux and git start unpacking in my brain.

2014. Hacker School


Have you worked professionally as a programmer?

If so, please describe your experience.


Working in programming and working hard at reinventing the idea of a “professional” programmer have been the last three years of my life. When I was “Wikipedian-in-Residence” I turned my job into programming by convincing management of my proposed bot-writing projects. In my own business I’ve won contracts to deliver reports that were the result of custom programs. So although I’ve never worked as a typical professional programmer, I like my life to be about delivering code for pay.


Do you have a Computer Science degree or are you seeking one?


I have a Bachelor’s degree in Mathematics from University of California Berkeley, and have applied and furthered my Computer Science knowledge outside of academia. In the far future I have considered enrolling in a  Master’s or PhD program. My draw towards a heavily mathematical emphasis looms. From my work with Wikipedia, a more human and social element has nestled in my head. Therefore it’s possible that my interests would converge in Computer Science degree.

Logicomix, page 162.

Prerequisite-free Set Theory – Just The Intuition

Logicomix, page 162.
Logicomix, page 162.

My favourite Hackerspace Sudo Room  is very close to  Bay Area Public School,  whose concept of a anti-capitalist University intrigues me very much. In chatting  about their plans for Math education, they expounded on the need for a primer to Set Theory, as they had been learning the Philosophy of Alain Badiou, who utilizes those foundations. Their request was for softer, more intuitive introduction. And just a short 18 months after that casual chat, this last, Saturday June 14th 2014, I held that public education, and it went brilliantly. 2 very curious mind showed up and we had fun reading the comic example aloud. The comic we used as a launching point is Logicomix: An Epic Search for Truth.

Continue reading

See how the Method of Reflections evolves as a recursive process.

Method of Reflections: Explained and Exampled in Python

The introduction of post is mirrored here, but the full tutorial is on IPython Notebook Viewer.

Method of Reflections Explained and Exampled in Python


See how the Method of Reflections evolves as a recursive process.
See how the Method of Reflections evolves as a recursive process.

The Method of Reflection (MOR) is a algorithm first coming out of macroeconomics, that ranks nodes in a bi-partite network. This notebook should hopefully help you implement the method of reflection in python. To be precise, it is the modified algorithm that is proposed by Caldarelli et al., which solves some problems with the original Hidalgo-Hausmann (HH) algorithm doi:10.1073/pnas.0900943106. The main problem with (HH) is that all values converge to a single fixed point after sufficiently many iterations. The Caldarelli version solves this by adding a new term to the recursive equation – what they call a biased random walker (function G). doi: 10.1371/journal.pone.0047278 . I hadn’t seen any open-source implementations of this algorithm, so I thought I’d share my naïve approach.

Read on at http://nbviewer.ipython.org/github/notconfusing/wiki_econ_capability/blob/master/Method%20of%20Reflections%20Explained%20and%20Exampled.ipynb

Continue reading

Morten’s Rule of Airports

This is Morten’s Rule of Airports, it’s history, and some of its benefits. The rule states:

If [the length of your layover] minus [the time it would take to comfortably get to the city centre and back] is greater than or equal to one hour, then you should exit the airport.

Or for those of you who read nerd:



During an expedition to see my friend Morten in British Colombia, in which I was struggling with travel-stress, I was struck by a story of enlightened travelling genius. Morten told of a tale where a flight of his was delayed at Charles de Gaulle Airport in Paris by several hours. It was an inconvenience because he was missing appointments for work and with friends in Denmark. He was getting stressed when he overheard a couple that were vocally calculating the time it would take to get to the city centre and back. Even with reasonable train delay buffers the journey would still leave them with 1 hour and 5 minutes in Paris. While it seemed a bit ludicrous to go for just 1 hour and 5 minutes, there wasn’t much else to do, so he boarded the shuttle, and set off mostly aimlessly. In his 65 minutes, he bought a croissant and a coffee and sat and pondered. A state of relaxation and bliss came over him, he relates. It was just atmosphere soaking, but it was real – especially compared to the departure lounge. On returning to the airport with the planned buffer in tact, he wrote to all his missed appointments that he was really sorry to miss their meetings. To his friend he recounted his hour of cosmically-displaced croissant munching and a restored inner peace.

Impressed very much by this story at the time, I’ve decided to honour it. Therefore I have abstracted its principle and named it after its originator. You can help me to codify this rule by implementing it in your life.

I walked this talk when I last stopped over in London. Landing at 8am, and taking off again at 2pm, I counted an hour’s travel in each direction and the need to be two hours early.  I was still left with 14-8-1-1-2 = 2 hours of freedom. At first I expected to do nothing besides my normal coffee and journalling  ritual. Yet, stepping out of Victoria station and on my way to hunting down a Costa’s, I encountered a City Cycle Hire. My plans changed in that instant, and after the nuisance of accepting all liability for my own dangerous cycling behaviour, out clicked a 3-speed indestructible 2-wheeler. Then with a buzzing smile across my face, I intrepidly raced to Buckingham Palace, Nelson’s Column, St. James Park,  and the Houses of Parliament with the goal of trying to get a fast-forward visual-effect through speed. Later I topped-up a SIM called an old friend for 13 mins 36 seconds,  and guzzled a Strongbow at 11am opening time. It was a brilliant interlude, providing adventure, exercise, and chemical euphoria in a condensed movie-long wander. Far better than purposelessly zooming in and out on the display cameras at the unbranded electronics shop (which I still did for 20 minutes anyway, but 60). I hope you are as convinced as I was to get out and enjoy the only-marginally but still better activities that are available in the open-air tourist traps.

In fact I had great joy at playing the fool as I asked a someone to take a picture of my luggaged self, and received this appalling shot.


It did however this piece of poetry:

The picture was cropped
by the tourist-photographer.
The subject was photographed
by Nelson’s trunk.

Profiles of Inspiring Wikimedians I Met at Wikiconference USA 2014

Wikiconference USA 2014, in New York, just finished, and more than usual this conference instilled in me a lot of motivating social energy. Yes, I did present there, twice, on “Answering Big Questions With Wikidata“, and “Signalling Open Access References,” but more so than usual I enjoyed attending other presentations. On reflecting why that was, I came to realize it was the earnest authentic effort of other Wikimedians, that shone so brightly. These are some of the more inspiring characters from the conference, but by no means a complete list.

Sumana Harihareswara

Sumana gave the opening keynote wherein she talked about implicit versus explicit exclusion. To introduce the subject she told of her positive experience at Hacker School which does actively exclude some people (there’s an application process), but as a result makes a more intentionally inviting space. That is because only inviting, inclusive individuals are selected for Hacker School. As she related this to Wikipedia, the shortcomings of our emphasis on liberty highlighted, perhaps it doesn’t ensure a safe learning space. A key quote that summed this was “in the Wikimedia community, since we don’t exclude anyone explicitly, we exclude others implicitly [sic].” Strong free speech defence is not muting some overbearing voices. A full transcript of the talk is available.

I particularly became aware of Hacker School’s “no well-actually’s” rule that Sumana presented. Many times during the conference, when someone was doing something in Python that had a technical side-note I wanted to slot into the conversation, but being newly aware of how this is disruptive, I simply allowed the real learning to continue. Sumana very much practiced this too, as I witnessed when someone came up to her to her to talk of her discomfort about someone who was wearing Google Glass at the conference. She without pause jumped to help, offering to go with the privacy advocate to find a conference organizer without any judgement on the camera-controversy itself.

Sumana also gave an impromptu Gender Diversity training, which came from Ada initiative. Actually this was offered twice and I attended both sessions (and it was my third time since I’d watched it online). Sumana’s rapid-fire style resonates with my personality and preferred learning style very well.  This allows me to really synch-up with the lesson and download the content with high-mental-bandwidth. In general Sumana is an over-clocked, but liquid-cooled processor, which is brilliant if you are too, and have fibre-optic connection.

In Zürich we were going to work on a python project together because we were both talking about wanting to pair-program python on twitter. Then we realized we didn’t want to work on the same project. I was really impressed with her straightforward, unpretentious communication when she said “It doesn’t seem like want to work on the same project – so perhaps another time.” The combination of directness and openness is liberating. Often we see one without the other, but having both is a fantastic combination.

And she is the mentor of:

Frances Hocutt

Frances gave a walkthrough and workshop on the MediaWiki API. I actually interact with the API a lot, but never directly only through pywikibot, so I was much enlightened by this lesson. In fact Frances explained with great care and deliberateness, from step zero, what is an API, all the way to the specifics of the MediaWiki API and how to use a client library. Frances’ teaching style is methodical. The pace is never frantic, taking time to get every word right, never needing to allow herself extra time with “er” or “um”. Learning from Frances is like having an immaculate syllogism patiently unfurl in front of you.

She also did the brave thing of giving a live demo of mwclient, starting from pip installation. Which was great to learn because I am only familiar with its not-quite-competitor pywikibot. So she both didn’t assume any technical knowledge, but didn’t leave experienced programmers bored, which is a hard balance to strike. This is her blog about her presentation including links to her slides.

Frances also taught the same Gender Diversity Training aimed at cis-men, which I attended. It was in this reprise that I most caught the proverbial advice –  “to follow your discomfort.”

Finally I’d like to credit her Chemistry knowledge and quick wit. In my previous blog post about sex ratios, I mentioned I’d found an occurrence of “sodium” for a sex. Frances quickly Sherlock Holmes’d that this was likely because someone had tried to enter not applicable – “na”, and probably received the auto-suggest chemical element.

Joelle fleurantin

Joelle reflecting. Attribution: http://fleurantin.cc/
Joelle reflecting. Attribution

At the conference Joelle gave a lightning talk that I enjoyed about her involvement in improving the Mozilla wiki, at which she has an internship. One day, she said, she became curious about the wiki’s usage statistics, but could not find anything more than minimal information that was contained in maintenance reports. So she has started building some scripts to analyse and visualize the Mozilla wiki’s usage.

Later in her lightning talk she also discussed her own autodidactic learning techniques, where she told of being a big recorded-conference-video watcher. Joelle has a particular penchant for linux.com, she shared. Therefore her being part of Gnome’s Outreach Program for Women should come as no surprise. As part of her Gnome involvement, Joelle fuzzes 0MQ, a stress-test debugging techinque as she patiently explained to me.

Over a beer in Brooklyn later on, she was coaxed to shed some immodesty and recap the tech-art piece that she’d made. It is a interactive installation where one wears headphones listening to a monologue of a woman talking about her inner thoughts and as you approach a video screen a proximity sensor tracks you. The closer you get to the screen, the video changes to reveal increasingly more intimate footage.

Megan Wacha

Megan on the Left. Attribution https://twitter.com/Museocat/status/472387217298825216/photo/1
Megan on the Left. Attribution

Megan is the Research and Instruction Librarian for the Performing Arts at Barnard College of Columbia University. Her presentation at the conference was about the multiplicities of roles for Librarians in Wikipedia. Regretfully because of scheduling I couldn’t attend it. She however attended mine and Wrought’s Signalling Open Access talk, and amazed from the Q and A. In a debate about whether it is it overcomplicating to import Open Access articles to Wikisource, as there may be corrections or retractions published, she noticed the more general problem. This was the first time I heard someone say “I’m going to bring this up with MLA.” Her reasonable position is that “we should really be citing the used-source and not the original publication.” I didn’t even know you could take issue with MLA.

During a lightning talk about there not being enough video in Wikipedia, a list of high profile articles without videos were cited, “Racing, Soccer, Dance.” On the word dance, with a large hacker-confidence she leaned over to me and said “we’re going to fix that.” What an assertion,  and I believe it because of her other on-wiki work. Do you know the Ntozake Shange article? Well its existence is owed to the inclusion of particularly hard to find sources – which is her speciality.


Two last special mentions, that I didn’t get enough time to know well, but want to hat-tip.

Dorothy Howard

Is currently working at the Metropolitan Library Council, as Wikipedian in residence.
Endearingly to me, she promotes the Wikipedia Authority Control project, which is easy to enjoy since it aggrandizes the work with VIAFbot. But this is also part of a holistic effort of hers to be a sort of techno-evangelist for a lot of wiki-library projects, and anyone that is in that space knows its presence is hotting up.

Jennifer Baek

Has been involved in SFC for a long time, Wrought told me that he remembers meeting her in 2008 in Berkeley. Apparently since then she has not let up. She was the main conference organizer, and fire-putter-outer. When I was accidentally double-booked (to speak in two places at once) she coolly helped make sane the logistics. Thank you for making the conference happen.

The Virtuous Circle of Wikipeda: The Poster

It may seem like a small piece of work, but I wanted to commemorate this moment – my first poster. I never had the need to manufacture one. Today I presented it at NetSci (Network Science) 2014, and received many useful comments on the research. We found a few other that are, like ourselves, translating  the ‘method of reflections’ into new domains. The paper related to this poster is in review, but you can also access a preprint files on github.

On the art side I’d like to thank unluckylion, for encouraging me to make a bold statement. I think it paid off, and I’m only mildly guilty about the blatant copyvio of the Wikipedia logo. Although I’ll use that point to show the necessity for the new attribution logos.

Skeumorph anyone?
Skeumorph anyone?


Sneak Peek at Wikimedia’s New Bold, High Concept Iconography

Wikimedia’s User Experience team invited me and a few others into the office to be part of focus group concerning a proposed new iconography.

The are two proposed new design languages, and an icon or “mark” for each Wikimedia project.
A selfie with two of the fablous design team, May and YuFei.
Penchant for selfies. Myself (left) with two of the fabulous design team, May (centre) and Yufei (right).

With free pizza proffed, the UX team Jared Zimmerman, May Galloway, and Yufei Liu, (pictured right) launched right into the need for these new set of icons, or “marks” as they are calling them.

  • The current logos don’t scale to 16 pixels square, and don’t overlay well.
  • To distinguish links to wikimedia sites on non-wikimedia sites.
    • Other sites have “social media” icons which if a brand is big enough replace a text link. Think facebook “f”, or twitter’s “t”.
    • Also, there was an intriguing mock-up which displayed twitter having a special preview of a link to Wikipedia, much like they treat youtube links specially.
  • Attribution to wikimedia content is verbose and cumbersome, and could be wrapped into an iconic link.

I’m convinced. Just like there are “post to facebook” buttons polluting the internet, there may as well be “read on Wikipedia” icons to restore some balance to the universe. Even though it’s minor, the attribution point is also valid. When I want to attribute commons – like I do on other parts of this blog – well all that copypasta is half of my repetitive strain injury.

Before continuing to show you what these marks actually look like, allow me to appease the User Experience team by disclaiming these disclaimers about the designs you are about to see.

  1. Not replacements. The marks are not meant to be replacements for the current logos (don’t call these logos). They are in-addition-to what we already have, and for others to use when pointing links to or mentioning Wikmedia.
  2. Not final. The marks shown here are not final, they are open for community review and scrutiny. I trust them because they sat quietly as I bombasted how the Wikpedia mark looks like it’s from M*A*S*H.
  3. Not forced. The marks will not be forced on the community. There will be a Request for Comment, and the outcome of that RfC will decide the fate of this project. Wikimedia Foundation is not making anyone do anything.
Wikipedia. While the tangram looks like it should be sent back to the army's crate-stenciling department, the path is pure Ikea, self-explaining simplicity.
Wikipedia. While the tangram looks like it should be sent back to the army’s crate-stenciling department, the path is pure Ikea, self-explaining simplicity.
Commons. *shutter sounds* There was some quibbling that commons is not just photos, so a camera doesn't represent it well. But I don't think you can beat it for recognizability. Notice in the Path, the lower semi circle motif turns into hand adjusting the lens.
Commons. *shutter sounds* There was some quibbling that commons is not just photos, so a camera doesn’t represent it well. But I don’t think you can beat it for recognizability. Notice in the Path, the lower semi circle motif turns into hand adjusting the lens.

With that said we can proceed to analysing the design language, of which there are two.


The first of the two languages, which in these images are the upper row, is called “Tangram”. A tangram (oh look there’s a link to Wikipedia, which wordpress could render with a small mark next to it) is a Chinese toy or puzzle, that consists of rearranging certain primitive shapes. All the tangram marks can be made by rearranging four shapes (sadly not pictured here). The tangram series is more “metaphoric” to use the UX team’s words. Although the Wikipedia mark, still a “W”, is not metaphoric as a notable exception. It’s also the simpler of the two series. Often times making out the meaning is a bit more oblique, but easier to see once the meaning is pointed out, which I do in the captions.


Path, shown in the lower row, is the more complicated set of the two. The UX team still says that these will work at 16 by 16 px. They are described as having a sketchier feel, and were explained to preserve the circular nature that exists in the current logos. Path’s meanings are more literal, and thus easier to decipher at first glance, which Jared Zimmerman said, almost regrettably, will bias people to like them better.

I’m sure you have many comments because this is close to a bikeshedding sort of discussion, but that is good because the UX team want your feedback. So make sure you send it to the right people.

Now you may enjoy your sneak peek.

Wiktionary. Do you know what a catchword is? Well that is how we used to access dictionaries, and the UX team is not afraid of a little skeumorph.
Wiktionary. Do you know what a catchword is? Well that is how we used to access dictionaries, and the UX team is not afraid of a little skeumorph.
Wikivoyage. You'll kick yourself for not getting this tangram - it's the sun setting behind a mountain range over a wavy sea. I'm being serious.
Wikivoyage. You’ll kick yourself for not getting this tangram – it’s the sun setting behind a mountain range over a wavy sea. I’m being serious.
Wikiversity. Tangrams show people coming together (althoug in my experience that doesn't equate learning). The path riffs on the classic
Wikiversity. Tangrams show people coming together (althoug in my experience that doesn’t equate learning). The path riffs on the classic graduation cap.
Wikispecies. The official explanation of the tangram is that it is the silhouette of a twisted double helix. And the path 'fingerprint' is more than endearing.
Wikispecies. The official explanation of the tangram is that it is the silhouette of a twisted double helix. And the path ‘fingerprint’ is more than endearing.
Wikisource. The tangram big-stack-of-papers is a stroke of genius IMHO.
Wikisource. The tangram big-stack-of-papers is a stroke of genius IMHO.
Wikiquote. Both are loud and clear. The there was discussion as to whether the displayed quotation marks were international enough. As Germans, French, and shockingly even non-Europeans do it differently.
Wikiquote. Both are loud and clear. There was discussion as to whether the displayed quotation marks were international enough. As Germans, French, and shockingly even non-Europeans do it differently.
Wikinews. The tangram is supposed to a person reading the paper (see it now‽). The path is as obvious as it can get.
Wikinews. The tangram is supposed to a person reading the paper (see it now‽). The path is as obvious as it can get.
Wikidata. If you don't know, the currently logo says "Wiki" (in Morse code I believe). The tangram explanation here was to design something that said "input-output". The path tries to show that the data (lines) are connected, and could be circumscribed in a hexagon.
Wikidata. If you don’t know, the currently logo says “Wiki” (in Morse code I believe). The tangram explanation here was to design something that said “input-output”. The path tries to show that the data (lines) are connected, and could be circumscribed in a hexagon.
Wikibooks. Books must predate graphic design.
Wikibooks. Books must predate graphic design or something.
Meta. The tangram is suppose to give a talk bubble conversation feel, although it was also pointed out that it looks like two laptops interfacing through a mirror. I hope the Path is replaced because it is much to much "live chat" on godaddy or somesuch.
Meta. The tangram is suppose to give a talk bubble conversation feel, although it was also pointed out that it looks like two laptops interfacing through a mirror. I hope the Path is replaced because it is much to much “live chat” on godaddy or somesuch.
Mediawiki. The sunflower disappears, but the brackets remain. Curly brace fans have a lot to be happy about.
Mediawiki. The sunflower disappears, but the brackets remain. Curly brace fans have a lot to be happy about.
Labs. The tangram gives you a sort of walkie-talkie upload-download feel, which is inline. And of course, there was not a lot do you can do with a unicorn.
Labs. The tangram gives you a sort of walkie-talkie upload-download feel, which is appropriate. And of course, there was not a lot do you can do with a unicorn.
Incubator. The path was commented by a focus group member to look like an avocado, to which the UX team's only response was that avocados don't have circular yolks.
Incubator. The path was commented by a focus group member to look like an avocado, to which the UX team’s only response was that avocados don’t have circular yolks.
Fondation. There are many tangrams because the UX team said, perhaps there should not be just one mark for foundation, in its many roles.
Fondation. There are many tangrams because the UX team said, perhaps there should not be just one mark for foundation, in its many roles.

Sex Ratios in Wikidata Part III

For a better reading experience read this post in the IPython Notebook Viewer.



I recently got back from the Mediawiki Hackathon in Zürich, where I was once again energized and inspired by the Wikidata Dev Team and Community. In chatting, they reminded me of analyses I ran in March 2013, Sex Ratios and Wikidata Parts I and II, about the state of a controversial Wikidata Property – P21 a.k.a. “sex or gender”. They suggested it was about time to reinvestigate.

Since Part I and II a lot has happened: the property has been renamed (from “sex” to “sex or gender”), it’s database constraints have been changed (from 3 accepted values to 13), and of course Wikidata has continued to proliferate (now about 400 million triples).

Therefore a few questions are begged:

  1. What are the currently used values of ‘sex or gender’, and their ratios in each language?
  2. How does May 2014 data compare to a year ago?
  3. What are the most represented neither ‘male’ nor ‘female’ ‘sex or gender’s?
    1. And which languages use them most?
  4. Per used sex value, what are the average number of accompanying properties?

That last question is not like the others, but comes from exploring the new Wikidata Toolkit, a library for parsing Wikidata dumps. Trying to stretch the imagination of what can be done with Wikidata data is a new hobby of mine, and I am giving a talk about it at Wikiconference USA, titled “Answering Big Questions with Wikidata”. For now the Wikidata Toolkit is at version 0.1.0, which is still not entirely feature-complete, but works perfectly to extract complete, daily-fresh data. For my own convenience I subset the data in java and then json export (github link), allowing me to munge it in Python with the “Pandas” library, which is exactly what you see here.

The biggest change in my opinion is the broadening – (but still not broad enough in my opinion – of the “accepted values” of the property. Let’s see what wikidatians are using these days. Below are the English titles of the QIDs and how many different wikidata-linked-wikis are connected to an item utilising that value.

In [1]:
{english_label(qid): language_count for qid, language_count in used_sexes_count.iteritems()}

{u'Female': 89,
 u'female': 367,
 u'female animal': 55,
 u'genderqueer': 23,
 u'intersex': 51,
 u'kathoey': 10,
 u'male': 395,
 u'male animal': 66,
 u'man': 3,
 u'sodium': 1,
 u'transgender female': 63,
 u'transgender male': 24}

So without delving into the validity of the classifications used, which I’ll adress later, we see 12 classifications in active duty. Compare this to the begrudging trinary we had a year ago, or Facebook’s announcement to use about 50 classifcations. We can also see that there are two heavily used classifications – by the metric of number of wikis using them – called ‘male’ (395 wikis) and ‘female’ (367 wikis). Why the difference in the number of wikis? We must clarify what we mean by use. We are talking about a Wikipedia, or a Wikisource, or a Wikivoyage instance that has a article that is linked to a Wikidata item which has a P21 “sex or gender” property. So that means that there are 28 or more Wikimedia wikis which have an article which Wikidata claims is about a ‘male’, but have no articles about ‘female’s. But there are also a lot of tiny wikis out there, which might explain this discrepancy.

Let’s restrict our data set only to those Wikis which have 1,000 or more articles that are linked to a Wikidata item with a P21 property. There are 42 such wikis as of May 2014. Now fe plot, the ratios or composition of the values of this property in each of those 42 wikis.

In [2]:
Sex Ratios Wikidata May 2014

As is visually evident, only the ‘male’ and ‘female’ categories are large enough to appear in the plot (later on we investigate this numerically). Therefore the chart is ordered by the ‘female’ percentage which ranges from 8.83% – Slovenian Wikipedia, to 19.97% – Serbian Wikipedia. English Wikipedia, the largest Wikipedia by article count comes in at 14.21% which is in the lowest quarter. Wikimedia commons, the only non-Wikipedia represented here performs relatively well at 18.86%.

These percentages are still systematically low, and tell a story that we’ve long known about representational bias. But what about the momentum? What difference are our efforts at uncovering and addressing systemic bias producing?

Comparing May 2013 to March 2014

These next tables inspect how a language’s composition changed in the previous year. We consider all languages that had at least 1,000 P21 associated properties in both years. I’ll disclaim that the differences come from the a Wikipedia’s content changing, but also from Wikidata becoming more connected to those different wikis. It’s not possible at the moment to disentangle these two causes. Another complicating factor, that we will investigate later on, is the growth of the neither-male-nor-female entries which could account for this drop – but (spoiler) they don’t.

In each table there is the percentage female from May 2013, from March 2014, and the ‘change%’ that this represents, “year-over-year” (even though its about 14 months).

We sort by ‘change%’, and first look at the largest losses in ‘female’ percentage.

Top Losers

In [17]:
diffdf.sort(columns='change%', ascending=True)[['female_may2013','female_march2014','change%']].head(10)
female_may2013 female_march2014 change%
enwiki 0.1845 0.142132 -22.96
gawiki 0.1456 0.118133 -18.86
afwiki 0.1406 0.115850 -17.60
cswiki 0.1705 0.141063 -17.27
frwiki 0.1658 0.141045 -14.93
zhwiki 0.2062 0.178885 -13.25
itwiki 0.1667 0.144760 -13.16
hywiki 0.1633 0.141859 -13.13
ruwiki 0.1627 0.142226 -12.58
htwiki 0.0531 0.047382 -10.77

10 rows × 3 columns

English Wikipedia fell the most in the percentage of it’s P21 properties being marked ‘female’. The reprsentation of articles marked ‘female’, compared to all others, dropped by about 4% in absolute terms, which is -23% year-on-year.

Also of note, the Hatian Wikipedia, the previous worst at 5.3% slid to retain the dubious title at 4.7%.

What about the other end of the chart?

Top Winners

In [18]:
diffdf.sort(columns='change%', ascending=False)[['female_may2013','female_march2014','change%']].head(10)
female_may2013 female_march2014 change%
urwiki 0.1319 0.486671 268.97
ocwiki 0.1261 0.159599 26.57
mlwiki 0.1636 0.202758 23.93
bnwiki 0.1313 0.161183 22.76
mznwiki 0.1041 0.125305 20.37
arzwiki 0.2392 0.287158 20.05
ltwiki 0.1190 0.142340 19.61
arwiki 0.1293 0.153516 18.73
warwiki 0.1003 0.116598 16.25
tlwiki 0.2943 0.340477 15.69

10 rows × 3 columns

Urdu Wikipedia gained a massive 268% year-on-year increase in their ‘female’ ratio. Now 49% of their P21-tagged articles have label ‘female’. Does anyone closer to this community know if there were any tagging efforts?

Tagalog Wiki, previous best, continued to increase to 34% of their P21-tagged articles having label ‘female’.

There has been a lot of movement in sex-ratios of different languages. As stated earlier this is also due in some part to the maturing of wikidata clusters. Next year we will be able to see if there is a deceleration in these ratios moving.

We now move on to invesitage the second confounding factor, the increase in accepted “sex or gender” values.

Non ‘male’ or ‘female’ values.

As we saw earlier, there are now 12 properties that are being used to describe P21. In May 2013 the only non-male-female term was intersex, and P21 said that you should be one of male, female, or intersex. I was quite angered by this prescriptiveness, but with help from online discussions, and with thanks to the gendergap mailing list some of those policies have changed. This now is the “sex or gender” property, rather than just “sex” which I consider a mixed result – quite literally. I am pleased that there is one instance of this value that is “sodium”, because I support this prorpety allowing any value. To be clear, what an “accepted” value means, is that periodically a check is run on the database, and non-accepted values are compiled into a list for user-attention. So robot won’t fight you if you use an unaccepted value, but the fodder is there for a human combatant.

Two more of the new values mention “animal” because in Czech and Finnish, there are seperate words to describe sexes of non-human animals. And of course Wikipedia has articles on famuos animals too.

Some others like ‘Female’ (a 1933 film), and ‘man’ are probably due to human tagging errors.

Below are the wiki’s which have the highest number of non ‘male’ or ‘female’ values as represented by the new column at the end “non_MF%”. As you can see none of them exceed two-tenths of 1%. So the upper analysis of year-on-year-change could at most be influenced by error range of ±0.2%.

In [130]:
top_non_MF.sort('non_MF%', ascending=False)
wiki non_MF%
male animal yiwiki 0.181159
transgender female urwiki 0.092994
Female mgwiki 0.080321
genderqueer zh_min_nanwiki 0.066445
intersex ckbwiki 0.058754
transgender male arzwiki 0.042105
female animal hywiki 0.038812
kathoey thwiki 0.012922
man jawiki 0.001523
sodium eswiki 0.000990

10 rows × 2 columns

Intriguingly, Urdu which tops the charts in year-on-year ‘female’ increase also the leader in higest ratio of “transgender female” at nearly one-tenth of one percent, or about 1 in a 1000. This lends credence to the idea that some Urdu Wikipedians have been busy.

Accompanying Data Richness

Markus Krötzsch, instigator of the Wikidata Toolkit, on which this research rests, talks about the complexity of Wikidata data. Convincingly he discusses why full unstructured access to the data is important – creative queries. Both star and tree shaped queries should be possible and at the users discretion.

One less trivial query I wanted to cook up was – “on items with a P21 value what are the properties by per item, by P21 value?” Framed in English, are wikidata items about males data-richer?

In [44]:
sex_props_df[sex_props_df['item_count'] > 5]
item_count total_props props_per_item
transgender female 41 398 9.707317
intersex 8 88 11.000000
female animal 6 44 7.333333
male animal 55 385 7.000000
genderqueer 8 63 7.875000
female 122288 738962 6.042801
male 768646 4816357 6.266028

7 rows × 3 columns

The above chart displays the properties per item, for all P21 values that occur 5 or more times. The results are telling, on avereage items with the ‘male’ property have 6.27 properties, and those with ‘female’ 6.04. It’s also worth mentioning that ‘transgender female’ averages 9.71 properties per item.


Without becoming too political, the biases that exist in Wikipedia’s representation of the world are systemic, and appreciable. We can see from our year-on-year calculations that there is movement in this dataset – albeit not always for the better. The representation of non-male-female items I suspect is lower than what a sample from the wolrd would indicate, but I don’t have any statistical reference, and would welcome suggestions on datasets with which to compare. Lastly we showed that not only in representation, but also in attention given to each item, underepresnted ‘sex or genders’ are less semantically-described.

Questions and criticism greatly recieved,


Start of Supporting Code

In [1]:
import json
from collections import defaultdict
import pandas as pd
import pywikibot
import decimal
NOPLACES = decimal.Decimal(10) ** 0
TWOPLACES = decimal.Decimal(10) ** -2
%pylab inline
VERBOSE:pywiki:Starting 1 threads...

Populating the interactive namespace from numpy and matplotlib

In [20]:
norm_sex[sexdf['total']>1000].sort(columns='non_MF', ascending=False).head(10)
female animal intersex kathoey Female transgender female male animal male female transgender male genderqueer man sodium non_MF
zh_min_nanwiki 0.000000 0.000000 0.000000 0.000664 0.000000 0.000664 0.787375 0.210631 0.000000 0.000664 0 0 0.001993
yiwiki 0.000000 0.000000 0.000000 0.000000 0.000000 0.001812 0.897645 0.100543 0.000000 0.000000 0 0 0.001812
cywiki 0.000371 0.000000 0.000000 0.000186 0.000371 0.000000 0.820375 0.178326 0.000186 0.000186 0 0 0.001299
ckbwiki 0.000000 0.000588 0.000000 0.000000 0.000588 0.000000 0.893067 0.105758 0.000000 0.000000 0 0 0.001175
thwiki 0.000000 0.000000 0.000129 0.000129 0.000388 0.000388 0.788345 0.210492 0.000129 0.000000 0 0 0.001163
mswiki 0.000000 0.000000 0.000000 0.000223 0.000223 0.000446 0.802679 0.196205 0.000223 0.000000 0 0 0.001116
ruwikiquote 0.000000 0.000000 0.000000 0.000552 0.000000 0.000000 0.909492 0.089404 0.000000 0.000552 0 0 0.001104
mlwiki 0.000270 0.000270 0.000000 0.000270 0.000270 0.000000 0.796161 0.202758 0.000000 0.000000 0 0 0.001081
eowiki 0.000125 0.000187 0.000062 0.000062 0.000374 0.000249 0.851300 0.147640 0.000000 0.000000 0 0 0.001060
kowiki 0.000075 0.000075 0.000038 0.000038 0.000491 0.000113 0.801608 0.197373 0.000075 0.000113 0 0 0.001019

10 rows × 13 columns

In [3]:
jsonfile = open('lang_sex.json','r')
bigdict = json.load(jsonfile)
lang_sex = defaultdict(dict)
for keystring, count in bigdict.iteritems():
    lang, sex = keystring.split('--')
    lang_sex[lang][sex] = count
used_sexes = defaultdict(list)
for lang, sex_dict in lang_sex.iteritems():
    for sex in sex_dict.iterkeys():
used_sexes_count = {sex: len(lang_list) for sex, lang_list in used_sexes.iteritems()}
In [6]:
sexdf = pd.DataFrame.from_dict(lang_sex, orient='index')
sexdf = sexdf.fillna(value=0)
#sexdf.plot(kind='bar', stacked=True, figsize=(10,10))
Norm_sex is not "normal" sex, but rather the Sex-data normed into percentages.
norm_sex = sexdf.apply(lambda col: col / float(col.sum()), axis=1)
In [8]:
#Tranforming QIDs into English labels.
enwp = pywikibot.Site('en','wikipedia')
wikidata = enwp.data_repository()

def english_label(qid):
    page = pywikibot.ItemPage(wikidata, qid)
    data = page.get()
    return data['labels']['en']

sex_qs = [str(q) for q in norm_sex.columns]
sex_labels = [english_label(sex_q) for sex_q in sex_qs]

norm_sex.columns = sex_labels
VERBOSE:pywiki:Found 1 wikidata:wikidata processes running, including this one.

In [9]:
#norm_sex.index = [label.replace('wiki','') for label in norm_sex.index]
#comparing by total between two different dataframes requires 
#that norm_sex has not had any rows modified since it was created from sexdf
sexdf['total'] = sexdf.sum(axis=1)
fs1000 = norm_sex[sexdf['total']>10000].sort('female', ascending=True)
In [11]:
def show_by_lang_plot():
    fsplot = fs1000.plot(kind='bar', stacked=True, legend=True, figsize=(13,8), alpha=0.9, ylim=(0,1),
                         title= '''Comoposition of Wikidata Prorerty:P21 "Sex or Gender" by Language 
    (Languages with over 1,000 associated P21)''',

    plt.yticks(linspace(0, 1, num=11), [str(decimal.Decimal(x * 100).quantize(NOPLACES))+'%' for x in arange(0,1.1,0.1)])
    ticklocs, langs = plt.xticks()
    langstrs = [str(decimal.Decimal(norm_sex.loc[lang.get_text()]['female']* 100).quantize(TWOPLACES))+'%  '+ lang.get_text() for lang in langs]
    plt.xticks(ticklocs, langstrs)
    plt.xlabel('Language-Wiki percentage "female"')

For your edification, the full data, and not just the ‘female’ percentages.

In [15]:
female animal intersex kathoey Female transgender female male animal male female transgender male genderqueer man sodium
slwiki 0.000074 0.000000 0.000000 0.000074 0.000074 0.000074 0.911398 0.088307 0.000000 0.000000 0.000000 0.00000
lawiki 0.000060 0.000060 0.000000 0.000060 0.000180 0.000120 0.889302 0.110219 0.000000 0.000000 0.000000 0.00000
bewiki 0.000000 0.000000 0.000000 0.000099 0.000000 0.000099 0.876528 0.123273 0.000000 0.000000 0.000000 0.00000
cawiki 0.000026 0.000000 0.000000 0.000026 0.000103 0.000129 0.870905 0.128786 0.000026 0.000000 0.000000 0.00000
elwiki 0.000000 0.000000 0.000000 0.000068 0.000068 0.000068 0.869061 0.130734 0.000000 0.000000 0.000000 0.00000
euwiki 0.000080 0.000000 0.000000 0.000080 0.000080 0.000239 0.865686 0.133837 0.000000 0.000000 0.000000 0.00000
skwiki 0.000078 0.000000 0.000000 0.000078 0.000078 0.000078 0.864363 0.135245 0.000078 0.000000 0.000000 0.00000
frwiki 0.000020 0.000025 0.000000 0.000005 0.000107 0.000097 0.858680 0.141045 0.000005 0.000015 0.000000 0.00000
cswiki 0.000048 0.000000 0.000000 0.000024 0.000096 0.000096 0.858648 0.141063 0.000024 0.000000 0.000000 0.00000
enwiki 0.000009 0.000012 0.000002 0.000002 0.000069 0.000052 0.857699 0.142132 0.000007 0.000014 0.000002 0.00000
ruwiki 0.000038 0.000019 0.000010 0.000010 0.000077 0.000077 0.857515 0.142226 0.000019 0.000010 0.000000 0.00000
dawiki 0.000036 0.000073 0.000000 0.000036 0.000109 0.000073 0.856768 0.142904 0.000000 0.000000 0.000000 0.00000
ukwiki 0.000062 0.000031 0.000000 0.000031 0.000062 0.000031 0.855573 0.144178 0.000031 0.000000 0.000000 0.00000
dewiki 0.000013 0.000006 0.000003 0.000003 0.000034 0.000047 0.855591 0.144277 0.000013 0.000013 0.000000 0.00000
itwiki 0.000006 0.000028 0.000000 0.000006 0.000090 0.000107 0.854986 0.144760 0.000011 0.000006 0.000000 0.00000
eowiki 0.000125 0.000187 0.000062 0.000062 0.000374 0.000249 0.851300 0.147640 0.000000 0.000000 0.000000 0.00000
glwiki 0.000082 0.000082 0.000000 0.000082 0.000329 0.000164 0.851602 0.147658 0.000000 0.000000 0.000000 0.00000
etwiki 0.000079 0.000079 0.000000 0.000079 0.000079 0.000237 0.846676 0.152771 0.000000 0.000000 0.000000 0.00000
arwiki 0.000040 0.000040 0.000000 0.000040 0.000079 0.000079 0.846166 0.153516 0.000000 0.000040 0.000000 0.00000
idwiki 0.000208 0.000052 0.000000 0.000052 0.000104 0.000156 0.843220 0.156156 0.000052 0.000000 0.000000 0.00000
hrwiki 0.000078 0.000078 0.000000 0.000078 0.000078 0.000156 0.842670 0.156863 0.000000 0.000000 0.000000 0.00000
eswiki 0.000030 0.000040 0.000000 0.000010 0.000109 0.000079 0.841094 0.158589 0.000020 0.000020 0.000000 0.00001
ptwiki 0.000043 0.000043 0.000000 0.000014 0.000199 0.000114 0.840760 0.158785 0.000014 0.000028 0.000000 0.00000
bgwiki 0.000091 0.000045 0.000000 0.000045 0.000091 0.000181 0.839242 0.160305 0.000000 0.000000 0.000000 0.00000
huwiki 0.000120 0.000040 0.000000 0.000040 0.000200 0.000200 0.839014 0.160386 0.000000 0.000000 0.000000 0.00000
plwiki 0.000031 0.000021 0.000000 0.000010 0.000063 0.000094 0.839206 0.160575 0.000000 0.000000 0.000000 0.00000
nlwiki 0.000041 0.000027 0.000000 0.000014 0.000082 0.000123 0.838302 0.161343 0.000014 0.000041 0.000014 0.00000
hewiki 0.000085 0.000042 0.000000 0.000042 0.000297 0.000127 0.836914 0.162449 0.000000 0.000042 0.000000 0.00000
trwiki 0.000076 0.000038 0.000000 0.000038 0.000114 0.000191 0.833810 0.165656 0.000038 0.000038 0.000000 0.00000
fiwiki 0.000060 0.000060 0.000020 0.000020 0.000099 0.000079 0.824523 0.175100 0.000040 0.000000 0.000000 0.00000
jawiki 0.000030 0.000030 0.000015 0.000015 0.000167 0.000107 0.823550 0.176039 0.000000 0.000030 0.000015 0.00000
zhwiki 0.000087 0.000029 0.000029 0.000029 0.000261 0.000145 0.820476 0.178885 0.000029 0.000029 0.000000 0.00000
nowiki 0.000020 0.000020 0.000000 0.000020 0.000082 0.000082 0.819593 0.180183 0.000000 0.000000 0.000000 0.00000
shwiki 0.000073 0.000073 0.000000 0.000073 0.000367 0.000000 0.817768 0.181571 0.000000 0.000073 0.000000 0.00000
fawiki 0.000066 0.000033 0.000033 0.000033 0.000332 0.000033 0.816748 0.182721 0.000000 0.000000 0.000000 0.00000
rowiki 0.000096 0.000048 0.000000 0.000048 0.000048 0.000096 0.816821 0.182844 0.000000 0.000000 0.000000 0.00000
simplewiki 0.000051 0.000051 0.000000 0.000051 0.000306 0.000102 0.815151 0.184084 0.000051 0.000153 0.000000 0.00000
viwiki 0.000093 0.000093 0.000000 0.000093 0.000093 0.000186 0.813103 0.186247 0.000000 0.000093 0.000000 0.00000
svwiki 0.000042 0.000042 0.000000 0.000014 0.000111 0.000056 0.811433 0.188275 0.000014 0.000014 0.000000 0.00000
commonswiki 0.000045 0.000000 0.000000 0.000045 0.000089 0.000268 0.810849 0.188614 0.000045 0.000045 0.000000 0.00000
kowiki 0.000075 0.000075 0.000038 0.000038 0.000491 0.000113 0.801608 0.197373 0.000075 0.000113 0.000000 0.00000
srwiki 0.000074 0.000074 0.000000 0.000074 0.000074 0.000223 0.799718 0.199688 0.000000 0.000074 0.000000 0.00000

42 rows × 12 columns

In [16]:
maydf = pd.read_table('may2013.csv',sep=',', index_col=0)
maydf['female'] = maydf['perc'] / 100.0
diffdf = maydf.join(other=norm_sex,how='inner',lsuffix='_may2013', rsuffix='_march2014')
diffdf['change%'] = (diffdf['female_march2014'] - diffdf['female_may2013']) / diffdf['female_may2013']
diffdf['change%'] = diffdf['change%'].apply(lambda x: decimal.Decimal(x * 100).quantize(TWOPLACES) )
In [19]:
non_MF_cols = [col for col in norm_sex.columns if col not in ['male','female']]
norm_sex['non_MF'] = norm_sex[non_MF_cols].sum(axis=1)
In [128]:
top_non_MF_dict = dict()
for s in non_MF_cols:
    t = norm_sex[sexdf['total']>1000].sort(columns=s, ascending=False)[s].head(1)
    top_non_MF_dict[s] = {'wiki':t.index[0],'non_MF%':t[0]*100}
top_non_MF = pd.DataFrame.from_dict(data=top_non_MF_dict, orient='index')
In [43]:
jsonfile = open('sex_propcount.json','r')
sex_props_json = json.load(jsonfile)
sex_props = defaultdict(dict)
for keystring, count in sex_props_json.iteritems():
    sex, prop = keystring.split('_')
    sex_props[sex][prop] = count
sex_props_df = pd.DataFrame.from_dict(sex_props, orient='index')

sex_qs = [str(q) for q in sex_props_df.index]
sex_labels = [english_label(sex_q) for sex_q in sex_qs]

sex_props_df.columns = ['item_count', 'total_props']

sex_props_df.index = sex_labels

sex_props_df['props_per_item'] = sex_props_df['total_props'] / sex_props_df['item_count']
VERBOSE:pywiki:Found 1 wikidata:wikidata processes running, including this one.