WIGI, an Inspire Grantee

WIGI, the Wikipedia Gender Index, my project which looks at the gender representation in Wikipedia Biography articles, has won an Inspire Grant.

Over the last six months along with fellow Wikipedians we prototyped and extended this research into a paper Gender Gap Through Time and Space: A Journey Through Wikipedia Biographies and the ‘WIGI’ Index”. One aspect of the biography gender gap we were not able to observe however was the trend of female and nonbinary biography.  We were only ever looking at a single point in time because it’s too computationally complex to compare all the histories of the Wikipedias together at once. Now, with $22,500 and a small team, our aim is to sample this data weekly thereby gathering some longitudinal data on the way that Wikipedians are representing biographies.

Our project’s form is to create a data portal which  will display the visualisations of the state of gender in biographies. The underlying data which associates biography gender with Wikipedia language, date of birth/death, citizenship, profession, and celebrity status, will be purposefully published under an open license. We hope that other researchers can make use of this social indicator, much the in same way one can United Nation’s Gender Inequality Index.

The project is will be managed entirely on github, and should be completed in about 6 months.

It promises to be,



Asking Ever Bigger Questions With Wikidata

This is a Guest-Blog I wrote for Wikimedia Deutschland: copied here:

German summary: Maximilian Klein benutzt Wikidata als als Datenfundus für statistische Auswertungen über das Wissen der Welt. In seinem Artikel beschreibt er, wie er in Wikidata nach Antworten auf die großen Fragen sucht.

Asking Ever Bigger Questions with Wikidata

Guest post by Maximilian Klein

A New Era

Simultaneous discovery can sometimes be considered an indication for a paradigm shift in knowledge, and last month Magnus Manske and I seemed to have both had a very similar idea at the same time. Our ideas were to look at gender statistics in Wikidata and to slice them up by date of birth, citizenship, and langauge. (Magnus’ blog post, and my own.) At first it seems like quite elementary and naïve analysis, especially 14 years into Wikipedia, but only within the last year has this type of research become feasible. Like a baby taking its first steps, Wikidata and its tools ecosystem are maturing. That challenges us to creatively use the data in front of us.

Describing 5 stages of Wikidata, Markus Krötsch foresaw this analyis in his presentation at Wikimania 2014. The stages which range fromKnow to Understand are: Read, Browse, Query, Display, and Analyse (see image). Most likey you may have read Wikidata, and perhaps even have browsed with Reasonator, queried with autolist, or displayed with histropedia. I care to focus on analyse – the most understand-y of the stages. In fact the example given for analyse was my first exploration of gender and language, where I analysed the ratio of female biographies by Wikipedia Language: English and German are around 15% and Japanese, Chinese and Korean are each closer to 25%.

To do biography analysis before Wikidata was much harder. To know the gender of an article you’d resort to natural language processing or hacks like counting gendered categories and guessing based on first name. Even more, the effort had to be duplicated for each language that had to be translated. Now the promise of language-free semantic data, and tools like Wikidata Query and Wikidata Toolkit are here. The process is easier because it is more database-like; select, group by,apply, and combine.

With this new simplicity, let’s review what we have imagined so far. Here’s a non-exhaustive introduction to the state of creative question-asking so far:

Pushing Ourselves to Think Even Bigger

Can we think even bigger if we use more of the available data? Thinking about the fact that every claim may have an attached reference, Markus Krötzsch always wants to know, for a given set of claims what references must be believed in order to believe the set of claims? With that notion we could look at all the claims associated with all the items of a given language, and thus the required belief system of that langauge. At this point we could ask what are the differences in the belief systems of any two langauges?

Another way we could test the fundamental principles of knowledge and culture is to consider the chains made by the subclass of, instance of, or cause of properties. Every language is present at different links of each chain. So we can look at the differences in ways in which languages organize a hierarchy of concepts – or if they think it’s a hierarchy at all.

Much fun for logicians and epistemologists. But we can also ask more socially important questions, questions about how language and society relate. What biases do we have that we aren’t even aware of? The method, for which I’ve proposed a PhD, could be conducted as follows. We’re aware of sexism in our societies, and as you’ve seen we’ve started to build a statistical profile of how it manifests in Wikidata. Likewise we’re cognizant of racism and homophobia. We might next look at rates people appear in Wikidata by race and desire. Let’s assume we could train a model to say that these kinds of distributions are types of social biases. Next we could search every property in Wikidata to see if it indicated social bias. If successful we may find overlooked stigmas and phobias in society.

I claim that our theoretical question-answering ability has paradigmatically shifted with the growing up of Wikidata. Soon enough you won’t even need to be a sophisticated programmer to whisper your questions into the system. So next time your reading, browsing, querying or displaying Wikidata, challenge yourself to think about how to analyse it too.

Which Index Is WIGI Most Closely Related To?

In my lastest paper “Gender Gap Through Time and Space: A Journey Through Wikipedia Biographies and the ‘WIGI’ Index” (blog post and on arxiv.org), my co-author Piotr Konieczny and I proposed a gender index. WIGI, the Wikipedia Gender Inequality Index, is composed of many indicators, but one in particular, the “nation-WIGI”, was designed to be comparable with other well-known indices. The nation-WIGI ranks each nation by the ratio of female biography articles who are  citizens of that nation.  Designed in this way it is possible to correlate WIGI to other indexes. And potentially, we thought, given enough indexes and with high enough correlations, we could get a sense for what WIGI is measuring in terms of other indices.

Due to word-count limits, we were unable to submit this research question with the rest of the paper, so it is included here. Formally we formulated is thus:

RQ4: Of the other Gender Indices which divide also by nation which index is Wikipedia most closely related to?

First let’s recap the four other nation divided indices we are inspecting (see section 3 of our paper for more detail).

  • GDI
    • The UNDP’s Gender-related Development Index (GDI) introduced only in 1995.
    • A gender-focused extensions of the Human Development Index. GDI’s primary focus lies in gender-gaps in life expectancy, education, and incomes.
  • GEI
    • The Gender Equity Index (GEI) introduced by Social Watch in 2005.
    • Developed to measure all situations that are unfavourable to women, it ranks countries on three dimensions: education, economic participation and empowerment.
  • GGGI
    • The Global Gender Gap Index (GGGI) developed by the World Economic Forum in 2006.
    • Intended to allow comparative comparison of gender gap across different countries and years, it focuses on four areas:  economic participation and opportunity, educational attainment, political empowerment and health and survival statistic.
  • SIGI
    • The Social Institutions and Gender Index (SIGI) of the OECD Development Centre from 2007.
    • A composite indicator of gender equality that solely focuses on social institutions (norms, values and attitudes), as well as on the four dimensions of family code, physical integrity, ownership rights and civil liberties.

    Comparison Data:

    With each of the above four foreign indices we have a ranking associating a nation (sometimes referred to as an economy) and an ordinal position. We would like to understand how close two indices are, for which we use the Spearman rank correlation coefficient. Two other technical points to be addressed are that we must use the intersection of  nations covered by each index to avoid missing data problems. And lastly, we compute a calibration step to find the start decade of Wikidata-data that maximises the correlation in question.

    The full source code of this calculation is available on github.  Also as an aside, I have another blog post on an functional-programming solution to joining many dataframes at once, that was useful in computing these results.

    Finally we produced a comparison table of indices,  their correlation, the correlation significance, and the maximizing start decade.  We present it ordered by correlation:

    National-WIGI compared to Alternative Indexes


    Spearman Correlation


    Calibrated Start Decade


















    Each alternative index shows some statistically significant moderate correlation with our nation-WIGI index. This proves that the female ratio of Wikidata humans associated with a country is, at minimum, a legitimate addition to the landscape of gender inequality indexes.

    Additionally, the fact that each alternative index most highly correlates when we consider only those biographies starting around 1900 is a positive sanity check for our data. Intuitively this makes sense in the light of the fact that traditional indexes talk about modern history only.

    Still, what is the interpretation that our nation-WIGI is most highly correlated to GEI, and least with GDI? What do GEI and GDI measure that show what WIGI is measuring? We dig further into the methodologies of theses indices.

    Social Watch’s GEI explains itself that:

    “In Education, GEI looks at the gender gap in enrolment at all levels and in literacy; economic participation computes the gaps in income and employment and empowerment measures the gaps in highly qualified jobs, parliament and senior executive positions.”

    And the UN’s GDI reports itself as:

    “The new GDI measures gender gap in human development achievements in three basic dimensions of human development: health, measured by female and male life expectancy at birth; education, measured by female and male expected years of schooling for children and female and male mean years of schooling for adults ages 25 and older; and command over economic resources, measured by female and male estimated earned income.”

    So we find that both indexes use indications connected to education and economic activity. The differing factor ultimately is that the GEI additionally measures empowerment by positions of power whereas the GDI additionally measures life expectancy. This suggests that the ratio of female biographies by nation in Wikidata are more highly correlated to women’s positions of power by country than to life expectancy by country. That, at first glance, is commensurate Wikipedia’s notability policies. Notability in Wikipedia essentially defers to inclusion or absence in the journalistic and scholarly record. That means that humans in positions of power, as GEI covers, would would tend to be in Wikipedias in greater proportion. Thinking about GDI’s life expecetancy uniqueness, one does not obviously see a strong reason that those with greater life expectancy are more covered in Wikipedia.

    Clearly this is a very rough investigation, and our conclusions can only be limited. Yet we still have some evidence for Wikipedia’s notability policy effecting the gender representation. That link might be clear with some feminist reasoning, but the data also supports the notion. Surely this is a nice fact to know for those who criticize the notability inclusion as it stands.

    For questions or suggestions, contact me on twitter – @notconfusing.


Joining many DataFrames at once in Pandas: “n-ary Join”

Joining many DataFrames at once with Reduce

In my last project I wanted to compare many different Gender Inequality Indexes at once, including the one I had just come up with, called “WIGI”. The problem was that the rank and score data for each index was in a separate DataFrame. I need to perform repeated SQL-style joins. In this case I actually only had to join 5 dataframes, for 5 indices. But later, in helping my partner with her research, she came across the same problem needed to join more than 100. In my mind I saw that we wanted to accomplish this n-ary join. Mathematically I wanted this type of operation, which I couldn’t find in pandasjoin

The answer I enjoyed implementing, perhaps because I saw it as this type of repeated operation, is the reduce of functional programming.

Ok, say we have these two data sets:

In [5]:
Rank Score
Republic of China 1 0.356890
Kingdom of Denmark 2 0.347826
Sweden 3 0.345212
South Korea 4 0.343662
Hong Kong 5 0.342857
In [6]:
Rank Score
Iceland 1 0.8594
Finland 2 0.8453
Norway 3 0.8374
Sweden 4 0.8165
Denmark 5 0.8025

We’d probably join them like this:

In [7]:
wigi.join(world_economic_forum, how='outer', lsuffix='_wigi', rsuffix='_wef')
Rank_wigi Score_wigi Rank_wef Score_wef
Denmark NaN NaN 5 0.8025
Finland NaN NaN 2 0.8453
Hong Kong 5 0.342857 NaN NaN
Iceland NaN NaN 1 0.8594
Kingdom of Denmark 2 0.347826 NaN NaN
Norway NaN NaN 3 0.8374
Republic of China 1 0.356890 NaN NaN
South Korea 4 0.343662 NaN NaN
Sweden 3 0.345212 4 0.8165

But we want to generalize. Notice here we also inject the name of the DataFrame into the column names to avoid “suffix-hell” as I would like to term it.

In [1]:
import pandas

def make_df(filename):
    df = pandas.DataFrame.from_csv(filename)
    name = filename.split('.')[0]
    df.columns = map(lambda col: '{}_{}'.format(str(col), name), df.columns)
    return df

filenames = !ls

dfs = [make_df(filename) for filename in filenames]

Now here’s the reducer. I actually end up wanting an inner join in the end, but the type of join is not important to illustrate the fact.

Here we join 5 DataFrames at once.

In [2]:
def join_dfs(ldf, rdf):
    return ldf.join(rdf, how='inner')

final_df = reduce(join_dfs, dfs) #that's the magic
Score_gdi Rank_gdi Score_gei Rank_gei Rank_sigi Score_sigi Rank_wdf Score_wdf Rank_wef Score_wef
Nicaragua 0.912 102 74 37 53 0.8405 13 0.272727 6 0.7894
Rwanda 0.950 80 77 19 43 0.8661 134 0.096154 7 0.7854
Philippines 0.989 17 76 26 57 0.8235 6 0.322785 9 0.7814
Belgium 0.977 38 79 12 1 0.9984 73 0.163734 10 0.7809
Latvia 1.033 52 77 19 24 0.9489 82 0.157623 15 0.7691

I really like the elegance of this solution. I admit there may be other ways to go about it with pandas only, and I understand the R mentality of “no for loops”. Still this is precisely why I like pandas in python – you still get the freedom to play as you wish if it makes more sense to you.

Cyberwizard Institute: Retrospective


Cyber Wizard Institute

The Cyberwizard Institute  (CWI) was a free programming school based out of Sudo Room, running for the month of January 2015. The proclamation that I saw on their website before I volunteered to teach there was:

cwiThe idea is to be an anti-bootcamp. Anyone can participate. It’s free. We’re going to try hard to have lecture notes, assignments, and lecture livestreams up online. It will be primarily self-directed, but with guidance from higher level wizards.

As a founding member of sudoroom since 2011, but suffering from a recent malaise in my hacktivism, this was the perfect project to reinvigorate my involvement. What most appealed to me was the idea of an anti-bootcamp, because I’ve wanted to make clear to world the distinction I care about between start-up culture and technology. I wanted to do something metaphorically akin to hijacking the stereo system at a $4-coffee-wifi-shack and making a public service announcement that the computers are not just fancy TVs, but programmable instruments of self-empowerment, which, in addition, can be used for non-commercial purposes.

Meeting Every Day

Without any formal advertising, each sudoer leading CWI was pleasantly surprised when 27 wizardlings showed up on the first day (14 women and 13 men from my count).  When I remarked this to CWI’s originator @marinakukso, she responded that “when you offer a free programming class, with no experience required – people want that”.

I recall some apprehension when we introduced ourselves, and there was the occasional naïve posturing  of people who claimed themselves as programmers with the phrase “I know HTML”. But the need to impress quickly disappeared as we sat down to struggle with them in installing Linux on the laptops they’d brought.

The next day I was nervous with anticipation to arrive at an empty room after all we had shown fresh minds was that computer programming was about inexplicable Ubuntu hurdles. Still, with only a slightly leaky attendance most wizards did come back for more. And we went right on with teaching them bash.

We continued to meet for 5 hours daily with lectures and hackerspace-esque hands-on floating help from higher level wizards, which we dubbed “social code”. Our rhythm was found quickly, and only half way through the month CWI was feeling so magical, it received coverage in the East Bay Express:

“Many coding bootcamps in the Bay Area charge tens of thousands of dollars in fees, which can be seen as restricting access to what has become essential for finding a job in technology, let alone moving up in Silicon Valley’s so-called “meritocracy.” Kukso explained that Cyber Wizard Institute’s mission is very much aligned with that of Sudo Room, which is to give everyday folks the opportunity to understand and create the technology in their lives. “For a lot people who consider themselves nontechnical,” Kukso said, “a lot things relating to technology or coding seem mystical or secret, our perspective is … everyone can learn these types of things.’

Pedagogical Questions

Yet towards the end, I started to question the effectiveness and importance of CWI. From the beginning as facilitators we quipped that “anti-bootcamp” reallly meant “bootcamp”. And the calendar began by reflecting that.

  • Day 1: Install Linux
  • Day 2: Unix and Bash
  • Day 3: vim
  • Day 4: HTML
  • Day 5: javascript
  • Day 6: Networking
  • Day 7: Node.js
  • Day 8: Git
  • etc…

Which is exactly the way that substack, Oakland’s pre-eminent “unix philosopher,” would have it. Yet, that was before the collaborative aspects took over and I began to try and think about how I would teach a less trained non-programmer version of myself what I know now. I mixed in:

(click to view the recorded lectures)

Where substack was spreading his knowledge of artisinal web-buildery, I was attempting to proselytize a world of Mathematical elegance. At times I was worried this felt interfering and competitive to the wizards.

However the final projects did come to life, instigate solely from the intrinsic motivation of the new-wizards. On the last day arduino hacks and personal-itch websites really had materialized. After speaking to those who made it all the way through the month, they spoke of a brighter perspective than my own: perhaps we inadvertently succeeding at being an anti-bootcamp.

The Medium Was Always The Message

As another facilitator @Johnnyscript, at the  ending Cyberpunk Masquerade Wizard Initiation Ceremony, said we showed them what it coding is actually like – many differently opinionated hackers running around without too much top-down organization. We delivered the essence of the hackerspace more accessibly than just happening upon a room of silent geeks staring down. Our package, despite being a bit dishevelled, did form a solid curriculum, although it was not refined as something that you might pay $17,000 for. Yet it also was not an altar for silicon-valley start-up-ism.

Taken together, we find a point that I am surprised that I missed. Whereas  programming bootcamps are normally Cathedrals, as Eric Raymond might put it, we built a Bazaar.

Notconfusingly yours,

Your humble newb-druid.

Cyberwizard Institute II

“Will there be another Cyberwizard Institute?” many are asking. Likely, but it is as-yet unplanned because volunteer work is tiring. If you have the intitiative or want to hear about an inititiative, join our discussion tracker on github.


Preliminary Results From WIGI, The Wikipedia Gender Inequality Index

This is a preliminary list of results from a research project is being compiled into full paper on the subject.

The full paper, in it’s academic form is now available on arxiv.


WIGI is the Wikipedia Gender Inequality Index, a project whose purpose is to attempt to gain insight into the gender gap through understanding which humans are represented in Wikipedia. Professor Piotr Konieczny, and myself thought that, whereas some gender gap research focuses on the editors of Wikipedia directly, we would view the content and metadata of articles as a proxy measure for those editing. Although the notion of analysing Wikipedia content seems quite old, I believe the advent of Wikidata allows us a new range of ambitious questions to be asked.


We use Wikidata, the new semantic database that feeds Wikipedia. By inspecting it’s weekly data dumps, we are able to inspect all the semantic properties associated with every Wikipedia page in any language, all at once. In this case we focus on any article that is about a person, and their any data recorded for the properties gender, date of birth, date of death, place of birth, citizenship, and ethnic group (example). We do this courtesy of an excellent tool known as the Wikidata Toolkit.

We compare the found data to historical census data and the World Economic Forum’s Gender Gap Index.

For other computations we also supplement the original data with with aggregation maps to make cultures from place of birth, citizenship, and ethnic group, by using Mechanical Turk.

This project has been conducted in an Open Notebook Science way, where we have been posting our results and receiving feedback as we work. You can chat with us on-wiki, or on-github where all the code and data needed to reproduce this research is available.

Let’s begin:

 Summary Statistics

As of October 14 2014 we inspected a total of 2,561,999 or about 2.5 million “human” items, that is any Wikidata item with the property “instance of: Q5 (human)”.

On each of those items we look for the following additional properties and found  them no the following number of items.

% of total Items with property
ethnic group 0.30 7,772
country* 23.47 601,361
place of birth 23.93 613,092
date of death 28.79 737,522
citizenship 41.44 1,061,634
culture** 45.20 1,158,086
date of birth 57.92 1,484,003
gender 89.40 2,290,433
at least one site link 99.05 2,537,545
a “Q” ID 100.00 2,561,999


*country is determined by seeing if the place of birth is a country, or if it is a city, see if the city has a country property

**culture is determined by using translating ethnic group, place of birth, and citizenship into 1 of 9 world cultures as per Inglehart-Welzel map of the world with Mechanical Turk. Then we take the consensus of the three aggregated variables. (Actually there were no disagreements between the three variables.) All aggregation maps are available for inspection on github.

Now the first derived and naive statistic of interest – the total gender breakdown. As we’ve seen above 10.3% is of unknown gender, otherwise we encounter 75.7% male, 13.9% female, and <.01% nonbinary which is perhaps better described as 152 cases.

Sanity Checking With Historical Data

We want some sanity checking that the data from Wikidata reflects the world at large. To do this we compared our total population per year, calculated by date of birth, versus the world population.

Comparing the Wikidata data to historical census data  we find a high significant correlation in total population – Pearson correlation coefficent = .983 with  p<0.01. This lends some credence to the notion that this dataset reflects the world at large. (By the way the historical data trends backwards to 10,000 BCE, but the earliest date of birth in Wikidata is about 4,000 BCE.)

Total Biographies Over Time

These graphs show the absolute volume of items by date of birth and death by gender, and over all time, and 1800 onwards.dob_dod_totals_pretty

This first visualization of the gender gap shows how Wikipedia’s retroactive focus on history has been consistent in it’s bias in representing females. It’s also generally quite a smooth curve save for some noticable spikes around World War I and II.

It’s intriguing to contemplate how we might expect date of birth and death to be related. If they were equally well recorded – and barring extreme events like wars – the death curve would look like a right-shifted birth curve. However we see empirically that is not true. At all times the death curve remains absolutely smaller than birth, by a factor of about two-thirds. So we can see a bias in recording the date of birth more often than than date of death.

Gender Ratios Over Time

The indication of visual skew in gender prodded me to look at how the ratio of male female and nonbinary genders develop over time.

Note: From here I aggregate the nonbinary genders into a single class not for philosophical reasons of them, but for the ease of visualising the more dimensions they represent. I consider it import to be descriptive about what is found in the data, and to not to lose any perspective because of personal assumptions about gender. If you think there are better ways to describe this data, I would be glad to here from you.

We adjust our viewing window here to start at 1400CE here because the data is too sparse to provide meaningful visual data.


Curiously since about 1800 to present, the female ratio of biographies is greater when using the date of birth measure than the date of death measure. What is the interperation? Somehow recording female date of birth is more prominent in a way that recording date of death isn’t. Although both ratios are rising, somehow date of birth is outstripping date of death. It would be great to investigate how much this is owed to recording practices and how much it is owed to social phenomenon.

Notice after about 1990, the spike is very large, and even crosses 50%. This is more statistical anomaly than anything else, since the number of humans with date of birth about 1990 is very small as you can see in the volumes plot. There are only 12,000 entries with date of birth in 1990 and only 199 biographies born in the year 2000. Even with discounting very recent trends of the last 20 years, which describe humans that are just entering adulthood or younger, the female ratio is rising exponentially. I was expecting to fit a logistics curve to the female percentage so that we could predict when we might reach parity, however that notion does not makes sense with what is being shown. Although there it may not necessarily indicated equity, fitting an exponential model to this percentage we can calculate when the female percentage would reach 50%. By our calculations it would be February 2034 when the exponential extrapolation would reach 50% female representation.  But of course predicting growth of percentages can be lead to nonsensical results (as humourously shown in this xkcd comic). I suspect we will see a logistics model, but simply haven’t encountered the inflection point of slowing rate of growth yet.

Aggregating Cultures

First some caveats as to method which we use in the next section.

  • There is no good way to aggregate cultures perfectly. Aggregation in general assumes some loss of fidelity. The point in doing so is to gain a broader-stroke picture, and in this case simplify visualizations.
  • The method we used for aggregation – starting from the Inglehart-Welzel map of the world (right), and then “mechanical turking” in the rest of the values – comes loaded with it’s own cultural baggage and perspective.
    By DancingPhilosopher [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
  • Inglehart-Welzel map only really makes sense for modern geopolitical boundaries. For instance the notion of having a Protestant and Catholic world before Protestantism and Catholicism, does not make sense. We use those soft modern boundaries superimposed over the geographical region to determine historical values. So if you were born in ancient Greece, you are known as Orthodox in this method.
  • Some ethnicities were a mixture of two cultures, like “Thai-American”, in those cases we took the modifier, so we’d use “Thai” -> South Asian. There are two ways to do this and both of them are not very good, we make a compromise to get a rough picture. The full data is available for more munging if you would like to fine tune it.
  • We aggregate the the 9 cultures from 3 similar but different Wikidata properties in citizenship, ethnic group, and place of birth. Since each of those are different concepts, a conflict may arise – however in this research we did not find a case where different property aggregations gave different world cultures.

Gender Ratios By Culture

We make a cross-tabulation of gender by culture. A Chi-squared test show the observed distributions of gender by culture to be significantly. We now graph the female percentage of biographies by culture.


More than anything, I think what astounds most is the large different in the difference in absolute number of biographies by culture. The European and English-speaking world dominates by a large amount here. Although, it might be that European and English-Speaking biographies are simply more likely to be described in Wikidata at the moment, by some sort of quirk of the volunteer import process. Later we’ll see how that affects German and Austrian items.

If we do inspect the female percentages as-is, we find a very high showing for in the Confucian culture. After talking to some Confucian-world Wikipedians on twitter (who I can’t find now to credit) and fellow Wikipedia Researcher Hai-Yi Zhu from University of Minnesota, we produced the hypothesis that this is because the phenomenon of celebrity is larger in those cultures, and celebrity is more evenly gender-distributed. We will investigate the celebrity hypothesis in a bit. If you have another hypothesis, we welcome your input for testing.

We provide the graph of nonbinary percentages of biographies by culture too. The cultures are ordered in the same way as the female graph for ease of comparison. Notice that the ordering is relatively similar to the female graph – so on the surface, recording female biographies is linked to recording nonbinary genders too.


 Gender Ratios Over Time

Lets mix all these variable now, by viewing the culture ratio trends over time. To note our sample size as we continue, only 951,101 or about 35% of total records have all of date of birth, culture, and gender data.



You can see that the recent past around 1800 is a low point for female recognition in all cultures and most of history in the past 3 millennia. Likewise visually it is evident that historical trends in different cultures have, while not reaching 50%, peaked at much higher percentages. In the modern historical graph, we can see a rise occurring for all cultures, and super-linear growth even for the Confucian and South Asian countries. .The sky-rocketing ratios after 1990 are less significant as noted above.

Gender by Wikipedia Language

Now let us recall that there is one more dimension we have recorded, the sitelink dimension, which indicates whether or not for an item a Wikipedia language has an entry for it. To be clear, say for instance that Finnish Wikipedia has an article about a Japanese human; we would be commenting on Finnish Wikipedia. With this data we can analyse the female and nonbinary tendencies of a Language, not a nationality or culture.

Here we have plots that show the relative frequencies of female articles per Wikipedia language, versus the size of the language.


And again for nonbinary humans.


Notice in general there is no simple trend linking Wikipedia size to female representation. The visual technique with which I investigate here is to look at for the points whose magnitude from the origin is greatest. Mostly I see relatively a  flat constant rate, with a few Wikipedias standing out a bit, like the Japanese, Chinese and Tagalog. So again we are seeing some evidence for Confucian and South Asian cultures being less gender biased when following the sitelink method analysis.

Gender by Aggregated Wikipedia Language

To sure up the idea of cultural influence in the sitelinks analysis we aggregate the languages into the nine World Cultures as before. In this case, since there are only about 280 languages, I assigned all of the languages by hand, rather than resorting to Mechanical Turk.


To clarify, the technique used here is that every Wikidata item counts towards a culture if a sitelink exists in at least one language associated with that culture. So if an article has language links to English, Chinese, and Japanese wikipedia, that item counts only once towards each of the English-speaking and Confucian categories.

Now we have a more coherent picture about which types of Wikipedias by language are focusing on female articles. And we do continue to see a high Confucian showing.

Let us test our celebrity hypothesis. For the Chinese, Japanese, Korean, Tagalog, Urdu, German and English Wikipedias, we retrieved the page content of each Biography from 1930 until 1989 (recall that there are very few Biographies with date of birth 1990 and higher).

We search for the English or foreign language words that are associated with celebrity. The dictionary used is:

{'jawiki': [u'俳優', u'選手', u'歌手', u'ミュージシャン', u'モデル', u'アイドル'],

'zhwiki': [u'演員', u'運動員', u'歌手', u'音乐家', u'模特兒', u'偶像'],

'kowiki' : [u'배우', u'선수', u'가수', u'음악가', u'모델', u'우상'],

'tlwiki': [u'artista', 'aktor', u'player', u'mang-aawit', u'musikero', u'modelo', u'idolo'],

'urwiki': [u'اردو', u'کھلاڑ', u'گلوکار' , u'موسیقار' , u'ماڈل', u'بت'],

'dewiki': [u'schauspieler' , u'spieler', u'Musiker', u'Sänger', u'Modell', u'Idol'],

'enwiki' :[u'actor', u'actress', u'player', u'singer', u'musician', u'model', u'idol']}

If you can provide better translations than Google’s software, let me know. We consider a celebrity to be a biography that contains one of the above words within the first 200 characters of its Wikipedia entry.

Then we make a heatmap comparing the language, the decade and, the gender, and celebrity percentage.


Using visual inspection, at first glance we can see that the female matrix is darker in general that the other two matrices. So recorded females are more likely to be celebrities among these languages.

Likewise you can see that in general the heatmap transitions to being darker at the top than bottom, so we have shifted to being more celebrity conscious in most languages in recent years.

Lastly we see some vertical-striped features showing that for instance Tagalog is prone to being celebrity conscious across gender and time.

To determine the significance of the effects we perform a logistic regression analysis in predicting the celebrity percentage variable. The coefficient matrix is printed below.

coef std err z P>|z| 1
decade 0.0236 0.013 1.823 0.068 -0.002 0.049
enwiki 0.0509 0.875 0.058 0.954 -1.664 1.766
jawiki 0.7763 0.837 0.927 0.354 -0.865 2.418
kowiki 1.3834 0.832 1.662 0.097 -0.248 3.015
tlwiki 3.0009 0.945 3.176 0.001 1.149 4.853
urwiki 0.8901 0.869 1.025 0.306 -0.813 2.593
zhwiki 0.5383 0.846 0.637 0.524 -1.119 2.196
female 1.3580 0.453 2.999 0.003 0.471 2.245
intercept -47.9056 25.368 -1.888 0.059 -97.626 1.815


Depending on which arbitrary significance threshold you choose to use, we find different answers, but at least the female, and Tagalog, variables are significant with p<0.05. If we loosen the significance threshold slighly, decade, and Korean also become predictors. This lends a lot of credence to the notion that in the cases in which Women are recorded in Wikipedias, they have a strong tendency to be a celebrity.

Connections To The World Economic Forum Index

Indexes are useful, but they are more useful as a group of compatible and comparable indexes. We compared our place of birth and citizenship data as it related to gender, to the World Economic Forum Gender Gap Index. The World Economic Forum uses its own methodology to produce a scalar value on the interval (0,1) to rank the gender equality of a country. To match to that format, we take the Wikidata data in the form of female composition of biographies by country.

We performed a calibration step to see which time window of data would produce our ranking of countries most closely being correlated with the World economic forum. If the Wikidata dataset is used with the time window only considering the biographies with date of birth between 1890 and 1990, the Spearman rank correlation is 0.31 with a p value of 0.03. That means that there is some founding for accepting the female composition of Wikidata items of humans associated with a country as an inequality index, because is significantly correlated with other respected inequality indexes.

Here is a sample of the two rankings side-by-side. We display the top 10 as per the World Economic Forum rank, and then the top 10 as per the Wikipedia Rank. You’ll aslo see the associated WIGI rank, the raw scores for each, and the difference in the ranking.
World Economic Forum Top 10

Country WEF Rank Wikipedia Rank WEF Score Wikipedia Score Rank Difference
Iceland 1 30 0.8594 0.1895 -29
Finland 2 39 0.8453 0.1807 -37
Norway 3 22 0.8374 0.2142 -19
Sweden 4 1 0.8165 0.3452 3
Denmark 5 20 0.8025 0.2149 -15
Nicaragua 6 9 0.7894 0.2727 -3
Rwanda 7 108 0.7854 0.0962 -101
Ireland 8 64 0.7850 0.1586 -56
Philippines 9 3 0.7814 0.3228 6
Belgium 10 58 0.7809 0.1637 -48


WIGI Top 10

Country WEF Rank Wikipedia Rank WEF Score Wikipedia Score Rank Difference
Sweden 4 1 0.8165 0.3452 3
South Korea 117 2 0.6403 0.3437 115
Philippines 9 3 0.7814 0.3228 6
Bahrain 124 4 0.6261 0.3171 120
Mauritius 106 5 0.6541 0.2941 101
People’s Republic of China 87 6 0.6830 0.2812 81
Australia 24 7 0.7409 0.2760 17
Japan 104 8 0.6584 0.2732 96
Nicaragua 6 9 0.7894 0.2727 -3
Swaziland 92 10 0.6772 0.2593 82

We see how the rankings bear some similarity, but that the correlation is mild. Still we can take away that the notion of what the WEF is driving at with it’s measure, and the number of female biographies that exist about humans in a country, as somewhat related idea.

Data Reliability

The question of how well Wikidata accurately reflects all Wikipedias, is important to determine before addressing the question of how well Wikipedias reflect the world at-large.

During our research, we found a curious quirk in the way that nationality is recorded, and the story is instructive in showing that Wikidata still has a few artefacts of its bot-imported nature. A more in-depth analysis, I previously blogged about is available in a post about the “Wikidata and the Measure of Nationality“.

In short, the idea centres around an early finding, that indicated that Protestant European humans seemed to disappear in the 1930s, when we were determining culture just using the “Place of Birth” property. It looked like this:



This is what lead us to investigate how nationalities were being classified on Wikidata. The next graphs show which humans have which classification method – by place of birth, citizenship, or  both – for nationality. For Germanic humans we saw a large shift:


And for all other populations we witness no such thing:


After publishing these finding, a Wikimedian wrote in to explain that the import of Germanic human data into Wikidata occured through a bot called “FischBot”, and that the shift is likely only related to the way that that software operated. The moral being that we should still be vigilant in staying aware of the data quality in Wikidata.


It is not my intention to draw any large scale conclusions at the moment. For that I will wait until the publication of the paper for which this analysis is intended. Still I would be glad to hear any insights you might see until then.


We finished the writing the paper. An excerpt from the conclusion there:

Our research confirms that gender inequality is a phenomenon with a long history, but whose patterns can be analyzed and quantified on a larger scale than previously thought possible. Through the use of Inglehart-Welzel cultural clusters, we show that gender inequality can be analyzed with regards to world’s cultures. In the dimension studied (coverage of females and other genders in reference works) we show a steadily improving trend, through one with aspects that deserve careful follow up analysis (such as the surprisingly high ranking of the Confucian and South Asian clusters).


Tweet at me @notconfusing .


I programmed all of this research in using the IPython notebook, and it’s all entirely open source and hopefully reproducible from https://github.com/notconfusing/WIGI.

I plan to start parsing and filtering Wikidata montly to provide updated data, which should be coming soon.

  1. 0% Conf. Int.

The best part of soup is that soup doesn’t have parts.

This is a piece I wrote for Bulbes, a zine about soup.
The concept of  “smooth” versus “striated” spaces  (striated means lined or striped) by Deleuze and Guattari, is a stream of philosophy with a missing interpretation through soup.
Depending on your choice of violent conflict, the smooth/striated dichotomy has popularly been depicted in the film “Die Hard” and the 2002 Israeli Operation in Nablus. In those scenarios parties moved not through the striated spaces of elevators, stairs, streets and squares, but through smooth and direct routes like air ducts and holes cut in walls.
I prefer a more pastoral example to grasp the notion of seeing a smooth space under a striated one. Northern Californian valleys are, by default, quite literally a smooth space with rolling hills. A striated framework is imposed on top of them in the from of designated walking paths. This New Year’s Day I was hiking there with some friends when our crew realized that the right side of the bifurcated path we took was a dead-end about 10 metres in. Some walked backwards along the right path to the junction point to continue down the left path, but my girlfriend Adi walked across the imaginary hypotenuse directly thereby stepping off any of the paths. That is seeing the smooth under the striated.

Soups are the ultimate smooth in the foodspace. You can eat any part of the soup, at any time, by gracefully carving through the three-dimensional liquid with your spoon. The same cannot be said for a pie, sweet or savoury, as it has an inside and an outside, and a top and a bottom layer. Because of this you must eat its parts in an order, and that order may come with a value judgement. Likewise a burrito in foil is not smooth, it is a striated space of one stripe – you are supposed to eat it front to back in linear order. Even a combination of potatoes and peas on a plate must cross the threshhold of your lips in some discrete relative positional arrangement. That is not true for the soup group; a hummus, a miso, or a lentil stew. You can eat those in any order – or rather there’s no order speak of. There being no order frees us from their being judgement of how we ate and eschews a level of social etiquette.

Perhaps one of the great social orders and hierarchies that exist, at least in the Western world today, is the in relationship space. The way I have been taught to view relationships in my life, largely donated by the patriarchy, is striated. Here is a map of the mental conception of the pathways of relationships for me. I wonder how true it is for you?

Striated Relationship Space
Striated Relationship Space

It does not look like a soup. It is much more like a high road or a main drag, with a few side alleys you might find yourself unlucky enough to fall into. Relationships, says the striated-theory, are places to go with junctions and turns to take.

But that’s just one way to view it. In fact, in a conversation I was having with another partner last year, we decided to start having less sex. Quite easily we could slip into viewing this as that we were “Friends with Benefits” or “non-exclusive partners” now moving into the “Friend zone”. Let’s resist the temptation to impose striation. Let’s think about it as soup. The relationship space is more greatly simplified when soupified in smooth-theory.

Smooth Relationship Space
Smooth Relationship Space
Updated: A second way to think about a soupier model without an explicit time axis, although the a curve is drawn inside the space to demonstrate how a relationship might evolve over time.


(Notes: the image I have drawn here is not accurate for the relationship I described in the last paragraph. Additonally if anyone wants to take a stab at redrawing the smooth relationship space without a time axis I would be glad to post alternate imaginings).

In the relationships-as-soup-iverse, that conversation can be viewed as a modification of the composition of soup. The soup still exists, it might taste different, it might be more or less substantive, but it is still soup. Transferring the notion that soup has no order and carries no judgement, likewise the relationship is never eaten, or slurped wrongly. You cannot break etiquette or convention by having traversed the relationship roads poorly.

When I found this viewpoint I was relieved because I was having fears about whether I’d broken her expectations, or upsetted the social order between our mutual friends. When I switched from viewing relationships as a striation of discrete statuses, to a smooth soup of two people, I was liberated in a small way. Liberated because the framework had no room for our mutually-agreed feelings to be against-the-grain.

Perhaps relationships, like soups, don’t have distinguishable parts that can be assembled in the wrong order. Let your soup act as a tiny reminder to the see the smooth under the striated.


I received this facebook comment, which I found instructive on how striated relationship space can frustrate those that aren’t willing to submit to a complete striation.


  • Dee Coetzee The first diagram I’ve heard called the relationship escalator. It’s problematic not only because of its required linear ordering, but because you are required to continue making your way up the escalator, or else get off altogether. There is no room for contented plateaus where nobody has any interest in anything other than what they already have. The escalator also contributes to the frustrating lives of asexual people – a lot of people aren’t willing to accept someone in a romantic relationship (much less a marriage) who doesn’t have sex, because sex is lower down on the escalator.

    I think the soup model actually works pretty well because the soup can contain any mixture of sexual, romantic, familial, or friendship elements, and you can stir in new ingredients over time, or just keep enjoying what you’re eating now, as you prefer. (It is kind of hard to draw though – it feels like a high-dimensional space more than anything.)


Omni Radio 1 – What is the Omni Commons?

In this inaugural Omni Radio podcast, Kwe and I introduce some of the collectives of the Omni, it’s purpose and it’s position in Oakland’s current landscape. To find out more visit https://omnicommons.org/ .

The interviewees in this podcast are Cere from Counter Culture Labs, Jesse from Food Not Bombs, Margareta and Andrew from Backspace Wellness, and Jenny from Sudo Room.

Personal Statement: In Full

The “Personal Statement” for the graduate school application, is the attempt to explain how you will make a difference, not just in your research, but in making the University as a organism more equitable.

Luckily, my proposed research is precisely about  the kinds of social division that that the instructions to this document ask you to address:

Please describe how your personal background and experiences inform your decision to pursue a graduate degree.
In this section, you may also include any relevant information on how you have overcome barriers to access higher education, evidence of how you have come to understand the barriers faced by others, evidence of your academic service to advance equitable access to higher education for women, racial minorities, and individuals from other groups that have been historically underrepresented in higher education, evidence of your research focusing on underserved populations or related issues of inequality, or evidence of your leadership among such groups.

Each University wants a different length of personal statement (UC. Berkeley gives no specification, and University of Minnesota asks for a page). So I’ve drafted my most complete thoughts, and am posting them here before taking the editorial chainsaw to them. Plus I’m including the figure that I made from my transcripts which helped me think about disadvantages.

Personal Statement for Max Klein

This last month I attended my first ‘ecstatic dance’ event. I started to play a game with myself where I identified each unspoken assumption I was making about dancing, and trying to break them. I had to dance in one location, to not repeat a move too often, to stand, to take it seriously, to be energetic, to not have pauses, to synchronise it to the music. It’s a fantastic game to play, because the size of the list that you can make is astounding – and it’s never exhaustive. Like dancing constraints, the list of prejudices that effects who enters and succeeds higher education is long and never fully enumerated. And as in dance, higher education is shaped by it’s biases.

My discovery of the reality of social biases was a turning point in my life. The first time I awoke to my internalised racism a mix of discomfort-amazement overcame me. It was during a protest on the famous steps of Sproul hall. After a few rebel-rousing speeches a black woman came to the stage and started delivering activist poetry. The poem had themes of being a strong, earthy, celestial woman and was sung with jazz-swing. I grew bored and criticised the performance. And then I received a moment of self-awareness wherein I saw that I was dismissing the content because of the delivery. I remember feeling compelled to physically run away from the event, but it was unsuccessful because I was actually trying to run from myself. The entire night I sat in a tearful bewilderedness. I had been brought up to think of myself as not racist, and yet incontrovertibly I had just been racist. It was a painful moment that marked the beginning of when I realised I could have a lot of prejudices I wasn’t aware of.

After the illusion of a just world was shattered, my subsequent prejudices came to light more rapidly. My own misogyny became very real upon my reading of favourite academic Joseph Reagle’s Free as in Sexist deconstruction of sexism in Open Culture. (Of course it took a man to show me that.) I lost my religious dogma at the holocaust memorials in Berlin and Auschwitz, when I saw that accepting unquestioned messages were dangerous – including the ones from my Jewish family. Data showed me my homophobia when I made a spreadsheet of romantic partners grouped by whether or not I met them travelling, and then tried to explain why I was less straight when touring. Only last month the perusal of a blog of a woman I met at a wedding introduced me to the ‘fat stigma’ I had been harbouring. The continuous waves of realisation perpetuates my wonder at just how many unidentified stigmas I’m still holding.

The feeling I get from a solid attack on my belief-system is so powerful that chasing after it has become the driving force in my life. No wonder it’s precisely what my proposed research agenda is about. The proposal is to make an algorithmic social-bias-detector to run on crowdsourced databases. To think of it in concrete terms, let’s consider some observational research I hacked together and posted on my blog in 2013. It meant that I was the first person to see the proportion of female biographies across all Wikipedia languages. They ranged from 8.83% – Slovenian Wikipedia, to 19.97% – Serbian Wikipedia, compare English Wikipedia at 14.21%. There was a sinking feeling to see these paltry numbers, but also one of optimism to have performed a quanfication that can aid the correction of them. Despite how complex that moment was, it definitely informed me to double-down on this line of research. So I’ve proposed to mine all the properties in Wikipedia (or Freebase etc.), like profession, or properties that aren’t even about people, like the age of a concept, and see which fit into our model of what a bias looks like.

Through this detector I am trying to arm underserved academics with data. The aim is to provide a series of inequality indexes like the United Nations Gender Inequality Index. In fact I have already built a prototype, that compares gender, dates of birth, places of birth and ethnicity. This makes the inequality data scene richer because it comes from the the online Open Data landscape compared to opaque surveys. This will be valuable for academics currently researching inequality and make it much easier for other researchers to include inequality as a dimension in their projects. Moreover, the astounding unequal results of the internet’s own inequality index will generate lots of mainstream awareness of digital social biases which will lend credence to the access issue.

The second half of my continuing research, the search for unidentified biases, will address the “other individuals” of the “women, racial minorities, and individuals from other groups that have been historically underrepresented in higher education”. First let’s smash a binary. To be underrepresented in higher education is not a binary relationship. A person has many facets, and each facts has a different representation in higher education. Sometimes it seems like being underrepresented is reserved only for women and racial minorities; but I think that a big lesson and a nod to the work done by those coming from those backgrounds it to acknowledge a spectrum of underrepresentedness. Therefore as my project attempts to put it’s finger on many more social biases I hope everybody will come to see a part of themselves that is underrepresented. That will be very powerful I claim because once all people see themselves on that spectrum of underrepresentedness, it is easier to see the validity challenges of groups like women and racial minorities, since they are not in different class of challenges but rather of a same type of challenge of a different intensity. Essentially viewing yourself as a minority to any degree highlights the importance of good allyship society-wide.

When I insert myself into the spectrum of underreprestentedness the main challenge that comes to the surface is the lack of formal education in my immigrant family. Going over my transcript to submit in this application, the arching narrative it told obviated the difficulties of having no parents that had been to college before me.

At first after high school I didn’t go to college at all, I wasn’t suggested or pressured to, I was simply left to my own devices. It took some years until the social stigma (here’s another one) of having bad grades and low education frustrated me and so I began junior college with a man-on-fire attitude. Bolstered by the successful waypoint of transferring the educational passion continued. Yet towards the end of my college career, I felt wasn’t working towards a goal anymore as I had been, and my motivation slipped into a depression. This was the challenge because my family could not advise me on why to sprint to the end of that degree. I overcame that challenge by redirecting my energy in to real-world endeavours, which is when my Wikipedia volunteerism first blossomed – and serves as the basis for my research today. Even more, I’ve learned from that mistake by preplanning for the next depressive phase in graduate study when I inevitably encounter it, so instead of deflection I can greet the uncharted feelings with more reaching out, both to advisors and my past self.

I recall at that time of difficulty for me how my friends could phone their degree-holding family members, and received a boost over the shakey potholes so that they never amassed into a derelict street and their vehicle never totally fell apart. It’s really a class difference, when you imagine the family as an organism – like how an idea is an organism in memetics. It’s easier for the family to repeat behaviour rather than create new pathways. Having this handhold of minority-ness in education I believe has allowed me to see a galaxy of discrimination, and become a leader in changing them.

It has actually been in educational access that I’ve been a leader in change. After avenging the stigmatic education perceptions with my college degree, I had all desire to keep on learning, and none to impress anyone. In the winter after college I founded Sudo Room “hackerspace and creative community” along with 23 others. There we taught and learned mathematics and technology completely free of institutions for no money, as a non-profit, and for no ulterior motive. It was precisely because it was free as in speech and beer that we had community members, in including children and their school teachers, many of them people of colour, turning up for our “Today We Learned” sessions.

It was at Sudo Room I encountered radical activists that turned me on to start practicing my allyship seriously. I lapped up all the geek-feminism wiki anyone could link me to. I donated to, and took as much training as Ada Initiative would allow a cis male to participate in. Now I refuse be on all male panels, and admonish sexist remarks online and in person with best practices. I know there’s no special reward for doing the minimum things to not be sexist – that was a further humbling realization. The unconfortable epiphany cemented my dedication to this sort of DIY social uplift, and meant my being treasurer, the one person backstopping financially responsibility for a not-yet 501c3 nonprofit. I’m so serious about making an equitable society I laid down my own bankruptcy.

For all its merits, two years pursuing career hackerdom, I feel like I’ve hit a wall which I want graduate study address. I believe that in principle in today’s connected world one can fulfil their career without an institution – that would be a path where those coming from less educated backgrounds could flourish. But that is the exception, the theory breaks down with factors like closed access papers, and being propelled by serious, high quality collaborators. So as much as I might disagree with the concept in the abstract – because of it’s bias against women and racial minorities, etc. – I do want and need some thing from the institution. This reluctant acceptance is what has caused me to focus my interest in computational social science research directly on unequal representation on the web and by proxy in institutions. There is no need for me to translate my work and history into how it will aid representation in institution because my very work and history is about unequal representation in institutions.