Sex Ratios in Wikidata Part III

For a better reading experience read this post in the IPython Notebook Viewer.

 

Introduction

I recently got back from the Mediawiki Hackathon in Zürich, where I was once again energized and inspired by the Wikidata Dev Team and Community. In chatting, they reminded me of analyses I ran in March 2013, Sex Ratios and Wikidata Parts I and II, about the state of a controversial Wikidata Property – P21 a.k.a. “sex or gender”. They suggested it was about time to reinvestigate.

Since Part I and II a lot has happened: the property has been renamed (from “sex” to “sex or gender”), it’s database constraints have been changed (from 3 accepted values to 13), and of course Wikidata has continued to proliferate (now about 400 million triples).

Therefore a few questions are begged:

  1. What are the currently used values of ‘sex or gender’, and their ratios in each language?
  2. How does May 2014 data compare to a year ago?
  3. What are the most represented neither ‘male’ nor ‘female’ ‘sex or gender’s?
    1. And which languages use them most?
  4. Per used sex value, what are the average number of accompanying properties?

That last question is not like the others, but comes from exploring the new Wikidata Toolkit, a library for parsing Wikidata dumps. Trying to stretch the imagination of what can be done with Wikidata data is a new hobby of mine, and I am giving a talk about it at Wikiconference USA, titled “Answering Big Questions with Wikidata”. For now the Wikidata Toolkit is at version 0.1.0, which is still not entirely feature-complete, but works perfectly to extract complete, daily-fresh data. For my own convenience I subset the data in java and then json export (github link), allowing me to munge it in Python with the “Pandas” library, which is exactly what you see here.

The biggest change in my opinion is the broadening – (but still not broad enough in my opinion – of the “accepted values” of the property. Let’s see what wikidatians are using these days. Below are the English titles of the QIDs and how many different wikidata-linked-wikis are connected to an item utilising that value.

In [1]:
{english_label(qid): language_count for qid, language_count in used_sexes_count.iteritems()}

Out[1]:
{u'Female': 89,
 u'female': 367,
 u'female animal': 55,
 u'genderqueer': 23,
 u'intersex': 51,
 u'kathoey': 10,
 u'male': 395,
 u'male animal': 66,
 u'man': 3,
 u'sodium': 1,
 u'transgender female': 63,
 u'transgender male': 24}

So without delving into the validity of the classifications used, which I’ll adress later, we see 12 classifications in active duty. Compare this to the begrudging trinary we had a year ago, or Facebook’s announcement to use about 50 classifcations. We can also see that there are two heavily used classifications – by the metric of number of wikis using them – called ‘male’ (395 wikis) and ‘female’ (367 wikis). Why the difference in the number of wikis? We must clarify what we mean by use. We are talking about a Wikipedia, or a Wikisource, or a Wikivoyage instance that has a article that is linked to a Wikidata item which has a P21 “sex or gender” property. So that means that there are 28 or more Wikimedia wikis which have an article which Wikidata claims is about a ‘male’, but have no articles about ‘female’s. But there are also a lot of tiny wikis out there, which might explain this discrepancy.

Let’s restrict our data set only to those Wikis which have 1,000 or more articles that are linked to a Wikidata item with a P21 property. There are 42 such wikis as of May 2014. Now fe plot, the ratios or composition of the values of this property in each of those 42 wikis.

In [2]:
show_by_lang_plot()
Out[2]:
Sex Ratios Wikidata May 2014

As is visually evident, only the ‘male’ and ‘female’ categories are large enough to appear in the plot (later on we investigate this numerically). Therefore the chart is ordered by the ‘female’ percentage which ranges from 8.83% – Slovenian Wikipedia, to 19.97% – Serbian Wikipedia. English Wikipedia, the largest Wikipedia by article count comes in at 14.21% which is in the lowest quarter. Wikimedia commons, the only non-Wikipedia represented here performs relatively well at 18.86%.

These percentages are still systematically low, and tell a story that we’ve long known about representational bias. But what about the momentum? What difference are our efforts at uncovering and addressing systemic bias producing?

Comparing May 2013 to March 2014

These next tables inspect how a language’s composition changed in the previous year. We consider all languages that had at least 1,000 P21 associated properties in both years. I’ll disclaim that the differences come from the a Wikipedia’s content changing, but also from Wikidata becoming more connected to those different wikis. It’s not possible at the moment to disentangle these two causes. Another complicating factor, that we will investigate later on, is the growth of the neither-male-nor-female entries which could account for this drop – but (spoiler) they don’t.

In each table there is the percentage female from May 2013, from March 2014, and the ‘change%’ that this represents, “year-over-year” (even though its about 14 months).

We sort by ‘change%’, and first look at the largest losses in ‘female’ percentage.

Top Losers

In [17]:
diffdf.sort(columns='change%', ascending=True)[['female_may2013','female_march2014','change%']].head(10)
Out[17]:
female_may2013 female_march2014 change%
lang
enwiki 0.1845 0.142132 -22.96
gawiki 0.1456 0.118133 -18.86
afwiki 0.1406 0.115850 -17.60
cswiki 0.1705 0.141063 -17.27
frwiki 0.1658 0.141045 -14.93
zhwiki 0.2062 0.178885 -13.25
itwiki 0.1667 0.144760 -13.16
hywiki 0.1633 0.141859 -13.13
ruwiki 0.1627 0.142226 -12.58
htwiki 0.0531 0.047382 -10.77

10 rows × 3 columns

English Wikipedia fell the most in the percentage of it’s P21 properties being marked ‘female’. The reprsentation of articles marked ‘female’, compared to all others, dropped by about 4% in absolute terms, which is -23% year-on-year.

Also of note, the Hatian Wikipedia, the previous worst at 5.3% slid to retain the dubious title at 4.7%.

What about the other end of the chart?

Top Winners

In [18]:
diffdf.sort(columns='change%', ascending=False)[['female_may2013','female_march2014','change%']].head(10)
Out[18]:
female_may2013 female_march2014 change%
lang
urwiki 0.1319 0.486671 268.97
ocwiki 0.1261 0.159599 26.57
mlwiki 0.1636 0.202758 23.93
bnwiki 0.1313 0.161183 22.76
mznwiki 0.1041 0.125305 20.37
arzwiki 0.2392 0.287158 20.05
ltwiki 0.1190 0.142340 19.61
arwiki 0.1293 0.153516 18.73
warwiki 0.1003 0.116598 16.25
tlwiki 0.2943 0.340477 15.69

10 rows × 3 columns

Urdu Wikipedia gained a massive 268% year-on-year increase in their ‘female’ ratio. Now 49% of their P21-tagged articles have label ‘female’. Does anyone closer to this community know if there were any tagging efforts?

Tagalog Wiki, previous best, continued to increase to 34% of their P21-tagged articles having label ‘female’.

There has been a lot of movement in sex-ratios of different languages. As stated earlier this is also due in some part to the maturing of wikidata clusters. Next year we will be able to see if there is a deceleration in these ratios moving.

We now move on to invesitage the second confounding factor, the increase in accepted “sex or gender” values.

Non ‘male’ or ‘female’ values.

As we saw earlier, there are now 12 properties that are being used to describe P21. In May 2013 the only non-male-female term was intersex, and P21 said that you should be one of male, female, or intersex. I was quite angered by this prescriptiveness, but with help from online discussions, and with thanks to the gendergap mailing list some of those policies have changed. This now is the “sex or gender” property, rather than just “sex” which I consider a mixed result – quite literally. I am pleased that there is one instance of this value that is “sodium”, because I support this prorpety allowing any value. To be clear, what an “accepted” value means, is that periodically a check is run on the database, and non-accepted values are compiled into a list for user-attention. So robot won’t fight you if you use an unaccepted value, but the fodder is there for a human combatant.

Two more of the new values mention “animal” because in Czech and Finnish, there are seperate words to describe sexes of non-human animals. And of course Wikipedia has articles on famuos animals too.

Some others like ‘Female’ (a 1933 film), and ‘man’ are probably due to human tagging errors.

Below are the wiki’s which have the highest number of non ‘male’ or ‘female’ values as represented by the new column at the end “non_MF%”. As you can see none of them exceed two-tenths of 1%. So the upper analysis of year-on-year-change could at most be influenced by error range of ±0.2%.

In [130]:
top_non_MF.sort('non_MF%', ascending=False)
Out[130]:
wiki non_MF%
male animal yiwiki 0.181159
transgender female urwiki 0.092994
Female mgwiki 0.080321
genderqueer zh_min_nanwiki 0.066445
intersex ckbwiki 0.058754
transgender male arzwiki 0.042105
female animal hywiki 0.038812
kathoey thwiki 0.012922
man jawiki 0.001523
sodium eswiki 0.000990

10 rows × 2 columns

Intriguingly, Urdu which tops the charts in year-on-year ‘female’ increase also the leader in higest ratio of “transgender female” at nearly one-tenth of one percent, or about 1 in a 1000. This lends credence to the idea that some Urdu Wikipedians have been busy.

Accompanying Data Richness

Markus Krötzsch, instigator of the Wikidata Toolkit, on which this research rests, talks about the complexity of Wikidata data. Convincingly he discusses why full unstructured access to the data is important – creative queries. Both star and tree shaped queries should be possible and at the users discretion.

One less trivial query I wanted to cook up was – “on items with a P21 value what are the properties by per item, by P21 value?” Framed in English, are wikidata items about males data-richer?

In [44]:
sex_props_df[sex_props_df['item_count'] > 5]
Out[44]:
item_count total_props props_per_item
transgender female 41 398 9.707317
intersex 8 88 11.000000
female animal 6 44 7.333333
male animal 55 385 7.000000
genderqueer 8 63 7.875000
female 122288 738962 6.042801
male 768646 4816357 6.266028

7 rows × 3 columns

The above chart displays the properties per item, for all P21 values that occur 5 or more times. The results are telling, on avereage items with the ‘male’ property have 6.27 properties, and those with ‘female’ 6.04. It’s also worth mentioning that ‘transgender female’ averages 9.71 properties per item.

Conclusions

Without becoming too political, the biases that exist in Wikipedia’s representation of the world are systemic, and appreciable. We can see from our year-on-year calculations that there is movement in this dataset – albeit not always for the better. The representation of non-male-female items I suspect is lower than what a sample from the wolrd would indicate, but I don’t have any statistical reference, and would welcome suggestions on datasets with which to compare. Lastly we showed that not only in representation, but also in attention given to each item, underepresnted ‘sex or genders’ are less semantically-described.

Questions and criticism greatly recieved,

‽notconfusing

Start of Supporting Code

In [1]:
import json
from collections import defaultdict
import pandas as pd
import pywikibot
import decimal
NOPLACES = decimal.Decimal(10) ** 0
TWOPLACES = decimal.Decimal(10) ** -2
%pylab inline
VERBOSE:pywiki:Starting 1 threads...

Populating the interactive namespace from numpy and matplotlib

In [20]:
norm_sex[sexdf['total']>1000].sort(columns='non_MF', ascending=False).head(10)
Out[20]:
female animal intersex kathoey Female transgender female male animal male female transgender male genderqueer man sodium non_MF
zh_min_nanwiki 0.000000 0.000000 0.000000 0.000664 0.000000 0.000664 0.787375 0.210631 0.000000 0.000664 0 0 0.001993
yiwiki 0.000000 0.000000 0.000000 0.000000 0.000000 0.001812 0.897645 0.100543 0.000000 0.000000 0 0 0.001812
cywiki 0.000371 0.000000 0.000000 0.000186 0.000371 0.000000 0.820375 0.178326 0.000186 0.000186 0 0 0.001299
ckbwiki 0.000000 0.000588 0.000000 0.000000 0.000588 0.000000 0.893067 0.105758 0.000000 0.000000 0 0 0.001175
thwiki 0.000000 0.000000 0.000129 0.000129 0.000388 0.000388 0.788345 0.210492 0.000129 0.000000 0 0 0.001163
mswiki 0.000000 0.000000 0.000000 0.000223 0.000223 0.000446 0.802679 0.196205 0.000223 0.000000 0 0 0.001116
ruwikiquote 0.000000 0.000000 0.000000 0.000552 0.000000 0.000000 0.909492 0.089404 0.000000 0.000552 0 0 0.001104
mlwiki 0.000270 0.000270 0.000000 0.000270 0.000270 0.000000 0.796161 0.202758 0.000000 0.000000 0 0 0.001081
eowiki 0.000125 0.000187 0.000062 0.000062 0.000374 0.000249 0.851300 0.147640 0.000000 0.000000 0 0 0.001060
kowiki 0.000075 0.000075 0.000038 0.000038 0.000491 0.000113 0.801608 0.197373 0.000075 0.000113 0 0 0.001019

10 rows × 13 columns

In [3]:
jsonfile = open('lang_sex.json','r')
bigdict = json.load(jsonfile)
lang_sex = defaultdict(dict)
for keystring, count in bigdict.iteritems():
    lang, sex = keystring.split('--')
    lang_sex[lang][sex] = count
    
used_sexes = defaultdict(list)
for lang, sex_dict in lang_sex.iteritems():
    for sex in sex_dict.iterkeys():
        used_sexes[sex].append(lang)
        
used_sexes_count = {sex: len(lang_list) for sex, lang_list in used_sexes.iteritems()}
In [6]:
sexdf = pd.DataFrame.from_dict(lang_sex, orient='index')
sexdf = sexdf.fillna(value=0)
#sexdf.plot(kind='bar', stacked=True, figsize=(10,10))
Norm_sex is not "normal" sex, but rather the Sex-data normed into percentages.
norm_sex = sexdf.apply(lambda col: col / float(col.sum()), axis=1)
In [8]:
#Tranforming QIDs into English labels.
enwp = pywikibot.Site('en','wikipedia')
wikidata = enwp.data_repository()

def english_label(qid):
    page = pywikibot.ItemPage(wikidata, qid)
    data = page.get()
    return data['labels']['en']

sex_qs = [str(q) for q in norm_sex.columns]
sex_labels = [english_label(sex_q) for sex_q in sex_qs]

norm_sex.columns = sex_labels
VERBOSE:pywiki:Found 1 wikidata:wikidata processes running, including this one.

In [9]:
#norm_sex.index = [label.replace('wiki','') for label in norm_sex.index]
#comparing by total between two different dataframes requires 
#that norm_sex has not had any rows modified since it was created from sexdf
sexdf['total'] = sexdf.sum(axis=1)
fs1000 = norm_sex[sexdf['total']>10000].sort('female', ascending=True)
In [11]:
def show_by_lang_plot():
    fsplot = fs1000.plot(kind='bar', stacked=True, legend=True, figsize=(13,8), alpha=0.9, ylim=(0,1),
                         title= '''Comoposition of Wikidata Prorerty:P21 "Sex or Gender" by Language 
    (Languages with over 1,000 associated P21)''',
                         colormap='Set1')

    plt.yticks(linspace(0, 1, num=11), [str(decimal.Decimal(x * 100).quantize(NOPLACES))+'%' for x in arange(0,1.1,0.1)])
    
    ticklocs, langs = plt.xticks()
    langstrs = [str(decimal.Decimal(norm_sex.loc[lang.get_text()]['female']* 100).quantize(TWOPLACES))+'%  '+ lang.get_text() for lang in langs]
    plt.xticks(ticklocs, langstrs)
    plt.xlabel('Language-Wiki percentage "female"')

For your edification, the full data, and not just the ‘female’ percentages.

In [15]:
fs1000
Out[15]:
female animal intersex kathoey Female transgender female male animal male female transgender male genderqueer man sodium
slwiki 0.000074 0.000000 0.000000 0.000074 0.000074 0.000074 0.911398 0.088307 0.000000 0.000000 0.000000 0.00000
lawiki 0.000060 0.000060 0.000000 0.000060 0.000180 0.000120 0.889302 0.110219 0.000000 0.000000 0.000000 0.00000
bewiki 0.000000 0.000000 0.000000 0.000099 0.000000 0.000099 0.876528 0.123273 0.000000 0.000000 0.000000 0.00000
cawiki 0.000026 0.000000 0.000000 0.000026 0.000103 0.000129 0.870905 0.128786 0.000026 0.000000 0.000000 0.00000
elwiki 0.000000 0.000000 0.000000 0.000068 0.000068 0.000068 0.869061 0.130734 0.000000 0.000000 0.000000 0.00000
euwiki 0.000080 0.000000 0.000000 0.000080 0.000080 0.000239 0.865686 0.133837 0.000000 0.000000 0.000000 0.00000
skwiki 0.000078 0.000000 0.000000 0.000078 0.000078 0.000078 0.864363 0.135245 0.000078 0.000000 0.000000 0.00000
frwiki 0.000020 0.000025 0.000000 0.000005 0.000107 0.000097 0.858680 0.141045 0.000005 0.000015 0.000000 0.00000
cswiki 0.000048 0.000000 0.000000 0.000024 0.000096 0.000096 0.858648 0.141063 0.000024 0.000000 0.000000 0.00000
enwiki 0.000009 0.000012 0.000002 0.000002 0.000069 0.000052 0.857699 0.142132 0.000007 0.000014 0.000002 0.00000
ruwiki 0.000038 0.000019 0.000010 0.000010 0.000077 0.000077 0.857515 0.142226 0.000019 0.000010 0.000000 0.00000
dawiki 0.000036 0.000073 0.000000 0.000036 0.000109 0.000073 0.856768 0.142904 0.000000 0.000000 0.000000 0.00000
ukwiki 0.000062 0.000031 0.000000 0.000031 0.000062 0.000031 0.855573 0.144178 0.000031 0.000000 0.000000 0.00000
dewiki 0.000013 0.000006 0.000003 0.000003 0.000034 0.000047 0.855591 0.144277 0.000013 0.000013 0.000000 0.00000
itwiki 0.000006 0.000028 0.000000 0.000006 0.000090 0.000107 0.854986 0.144760 0.000011 0.000006 0.000000 0.00000
eowiki 0.000125 0.000187 0.000062 0.000062 0.000374 0.000249 0.851300 0.147640 0.000000 0.000000 0.000000 0.00000
glwiki 0.000082 0.000082 0.000000 0.000082 0.000329 0.000164 0.851602 0.147658 0.000000 0.000000 0.000000 0.00000
etwiki 0.000079 0.000079 0.000000 0.000079 0.000079 0.000237 0.846676 0.152771 0.000000 0.000000 0.000000 0.00000
arwiki 0.000040 0.000040 0.000000 0.000040 0.000079 0.000079 0.846166 0.153516 0.000000 0.000040 0.000000 0.00000
idwiki 0.000208 0.000052 0.000000 0.000052 0.000104 0.000156 0.843220 0.156156 0.000052 0.000000 0.000000 0.00000
hrwiki 0.000078 0.000078 0.000000 0.000078 0.000078 0.000156 0.842670 0.156863 0.000000 0.000000 0.000000 0.00000
eswiki 0.000030 0.000040 0.000000 0.000010 0.000109 0.000079 0.841094 0.158589 0.000020 0.000020 0.000000 0.00001
ptwiki 0.000043 0.000043 0.000000 0.000014 0.000199 0.000114 0.840760 0.158785 0.000014 0.000028 0.000000 0.00000
bgwiki 0.000091 0.000045 0.000000 0.000045 0.000091 0.000181 0.839242 0.160305 0.000000 0.000000 0.000000 0.00000
huwiki 0.000120 0.000040 0.000000 0.000040 0.000200 0.000200 0.839014 0.160386 0.000000 0.000000 0.000000 0.00000
plwiki 0.000031 0.000021 0.000000 0.000010 0.000063 0.000094 0.839206 0.160575 0.000000 0.000000 0.000000 0.00000
nlwiki 0.000041 0.000027 0.000000 0.000014 0.000082 0.000123 0.838302 0.161343 0.000014 0.000041 0.000014 0.00000
hewiki 0.000085 0.000042 0.000000 0.000042 0.000297 0.000127 0.836914 0.162449 0.000000 0.000042 0.000000 0.00000
trwiki 0.000076 0.000038 0.000000 0.000038 0.000114 0.000191 0.833810 0.165656 0.000038 0.000038 0.000000 0.00000
fiwiki 0.000060 0.000060 0.000020 0.000020 0.000099 0.000079 0.824523 0.175100 0.000040 0.000000 0.000000 0.00000
jawiki 0.000030 0.000030 0.000015 0.000015 0.000167 0.000107 0.823550 0.176039 0.000000 0.000030 0.000015 0.00000
zhwiki 0.000087 0.000029 0.000029 0.000029 0.000261 0.000145 0.820476 0.178885 0.000029 0.000029 0.000000 0.00000
nowiki 0.000020 0.000020 0.000000 0.000020 0.000082 0.000082 0.819593 0.180183 0.000000 0.000000 0.000000 0.00000
shwiki 0.000073 0.000073 0.000000 0.000073 0.000367 0.000000 0.817768 0.181571 0.000000 0.000073 0.000000 0.00000
fawiki 0.000066 0.000033 0.000033 0.000033 0.000332 0.000033 0.816748 0.182721 0.000000 0.000000 0.000000 0.00000
rowiki 0.000096 0.000048 0.000000 0.000048 0.000048 0.000096 0.816821 0.182844 0.000000 0.000000 0.000000 0.00000
simplewiki 0.000051 0.000051 0.000000 0.000051 0.000306 0.000102 0.815151 0.184084 0.000051 0.000153 0.000000 0.00000
viwiki 0.000093 0.000093 0.000000 0.000093 0.000093 0.000186 0.813103 0.186247 0.000000 0.000093 0.000000 0.00000
svwiki 0.000042 0.000042 0.000000 0.000014 0.000111 0.000056 0.811433 0.188275 0.000014 0.000014 0.000000 0.00000
commonswiki 0.000045 0.000000 0.000000 0.000045 0.000089 0.000268 0.810849 0.188614 0.000045 0.000045 0.000000 0.00000
kowiki 0.000075 0.000075 0.000038 0.000038 0.000491 0.000113 0.801608 0.197373 0.000075 0.000113 0.000000 0.00000
srwiki 0.000074 0.000074 0.000000 0.000074 0.000074 0.000223 0.799718 0.199688 0.000000 0.000074 0.000000 0.00000

42 rows × 12 columns

In [16]:
maydf = pd.read_table('may2013.csv',sep=',', index_col=0)
maydf['female'] = maydf['perc'] / 100.0
diffdf = maydf.join(other=norm_sex,how='inner',lsuffix='_may2013', rsuffix='_march2014')
diffdf['change%'] = (diffdf['female_march2014'] - diffdf['female_may2013']) / diffdf['female_may2013']
diffdf['change%'] = diffdf['change%'].apply(lambda x: decimal.Decimal(x * 100).quantize(TWOPLACES) )
In [19]:
non_MF_cols = [col for col in norm_sex.columns if col not in ['male','female']]
norm_sex['non_MF'] = norm_sex[non_MF_cols].sum(axis=1)
In [128]:
top_non_MF_dict = dict()
for s in non_MF_cols:
    t = norm_sex[sexdf['total']>1000].sort(columns=s, ascending=False)[s].head(1)
    top_non_MF_dict[s] = {'wiki':t.index[0],'non_MF%':t[0]*100}
top_non_MF = pd.DataFrame.from_dict(data=top_non_MF_dict, orient='index')
In [43]:
jsonfile = open('sex_propcount.json','r')
sex_props_json = json.load(jsonfile)
sex_props = defaultdict(dict)
for keystring, count in sex_props_json.iteritems():
    sex, prop = keystring.split('_')
    sex_props[sex][prop] = count
    
sex_props_df = pd.DataFrame.from_dict(sex_props, orient='index')

sex_qs = [str(q) for q in sex_props_df.index]
sex_labels = [english_label(sex_q) for sex_q in sex_qs]

sex_props_df.columns = ['item_count', 'total_props']

sex_props_df.index = sex_labels

sex_props_df['props_per_item'] = sex_props_df['total_props'] / sex_props_df['item_count']
VERBOSE:pywiki:Found 1 wikidata:wikidata processes running, including this one.

This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

2 thoughts on “Sex Ratios in Wikidata Part III”

Leave a Reply