Notconfusing rules for conversation: 2 rules and a jumpstart.

Meeting people can be a slog. “Hello, what’s your name?”, “Where are you from?”, “What do you do?”,  “How do you yawn?”. Yawn? Sorry I was nodding off just writing about how repetitive and tiresome modern meeting and greeting can be. Owing to the way that social networks store information about us, we’re used to thinking about people in a list of attributes “forms” structure. Trans-inclusive feminism has already laid out how select-a-value gender is problematic for self-determination, and it has even subtler consequences in meeting people. We’ve come to assume the next person you meet is some combinatoric permutation of drop-down menus. How are we supposed to meet that person that is our life long friend, but at the moment is just looks like one more INTJ or Virgo?

In fact the disillusionment from these gruelling social interactions is exactly the motivation for having friends, as a commiserating shelter. How do we let humans do the human thing and wow us with their outstanding creative expression of self from the moment we first meet? I submit notconfusing’s two rules for conversation.

  1. Ask questions that reflect choices people have or could make.
  2. Ask questions that have never been asked before.

Asking questions that reflect choices or decisions is a way to understand a person’s values and principles, which is more informative than part of their current happenstance. Even though this point is supposed to cause a deeper understanding, the questions need not be heavy. “When you’re sleeping on your favourite side, are you facing towards your alarm clock?” might tell you a bit about how much someone wants to combat their own habits without asking “how cognisant are you of your habits and how do you want to combat them?” The analysis of their choices can be done together out loud or both parties can be trusted to do so internally. In either case the point is to revel in the complexity of your partner, while gifting them a bit of Rogerian psychology.

Notice that just Rule 1 by itself could still allow for a “What are your hobbies?” variant, so Rule 2 is brought in to stem the tedium. At first it might seem impossible to ask an entirely unique question to every person, but – as I will prove – there really are an infinite number of these types of questions. Here are a few strategies.

The first strategy is analogous to a infinite game I learned called “Uses for…” where you try to come up with as many uses as you can for a specific item. The example I recall reading about is a bed sheet. So let’s play: It can be used as a tablecloth, as an escape rope for climbing out windows, as a substitute for an all-white painting, as a shooting target for short-sighted people, as a stencil for papier-mâché bed etc. etc. Try and come up with 5 more.

Now apply  creative riffing to the things you notice about your partner. For instance these are the topics I brought up from the last ice-breaking conversations I’ve had: reminiscing over video rental returns (standing near a letterbox), a comparison of how different tapes will tear when you don’t have scissors (electrical taped wallet), how often I think about life from a bird’s-eye view (standing at different levels), and the history of the vulcanization of rubber (rode with a flat tyre). Going off-script and generating questions based on the partner and surroundings guarantees freshness. The way your associate engages gives you some understanding of their gestalt person-ness.
Even if you are feeling like you filled out pointless forms all day at work so that you are sapped of your free-associativity, there is always the abstraction “meta” trick. Assume that you have racked your brain, and “where are you from?” is the absolute best question you can come up with because you are only meeting people out of some hateful obligation. You can apply question-abstraction to ask them “what does a person’s answer to <absolute best question I can muster> mean about a person’s personality?” Yes, use your own staleness as weapon. Since the result of the question-abstraction is also a question, it can be infinitely applied to itself to yield infinitely many unique questions. QED. (If you think this a sad proof, then I encourage you to really try it. I imagine you’ll become loopy enough by the hypnotic repetition of speaking that your co-discusser will either join in with you in your recursion – great fun – or they will have walked away, which is just a well.)

A last technique, if you want to borrow a bit, is to use my growing list of ice-breakers.  I’ve created them as group introductions when I was facilitating Sudo Room hackerspace meetings. As they are targeted to a tech-y crowd you may need to customize a bit –  exactly the point that I’m trying to champion.

With the application of these 2 rules you begin to transgress social mores for great good. You ought to explode small talk to eschew complacency. Then you can make more and better friends. Although ironically making this kind of conversation may have effect of pinning you as a werido. Yet disobey the laziness of phone alienation as Saul Williams does in Talk to Strangers  “… that square box don’t represent the sphere that we live in. The earth is not a flat screen, I aint trying to fit in.”

List of Yoga Quotations

Here are a list of Yoga quotations that I’ve compiled from my 200-hour yoga teacher training, other classes I’ve attended, and various yoga books.

Jon Isaacs

  • “The pose begins once you want to leave it”.
  • “Who went to their first yoga class because their life was going really well‽”
  • “I had a hedgefund guy fire me once because I was talking about greed in class.”

Sean Feit

  • On taking non-harming literally, “I take an antibiotic – genocide.”

Jean Mazzei

  • “You can have peace or mind, but not peace of mind because the mind’s purpose is to think.”

Cora Wen

  • “You probably think you have a knee, there’s nothing there. There’s no knee.”
  • “The knee is the prisoner of the hip and the ankle.”

Stacey Swan

  • On being a good teacher, “It’s not about putting your foot behind your head, but keeping it out of your mouth.”
  • “The american way is ‘no pain, no gain’, but yoga is ‘no pain, no pain'”
  • “A good yoga class should be like a Seinfeld episode,” (in that is should come full circle at the end.

Karen Macklin

  • “Vinyasa can also mean how you sequence your life.”

Adrianna Webster

  • “On an inhale, breathe out”.

Leslie Kaminoff – Yoga Anatomy

  • On the spine, “The full glory of nature’s ingenuity is apparent in the human spine…From an engineering perspective it is clear that we have the smallest base of support, the highest center of gravity, and the heaviest cranium (proportional to our body weight) of any other mammal.  As the only true bipeds on the planet, we are also earth’s least mechanically stable creatures.”
  • On breathing, “The energy expended in breathing produces a shape change that lowers the pressure in the chest cavity and permits the air to be pushed into the body by the weight of the planet’s atmosphere. In other words, you create the space and the universe fills it.”
  • Paraphrase on hand balances, “4/5th of the foot is dedicated to weight-bearing  and 1/5th is dedicated to dexterity. The hand (on the other hand) is 1/5th weight-bearing, 4/5th dexterous.”

Rudolf von Laban

  • “Each bodily movement is embedded in a chain of infinite happenings from which we distinguish only the immediate steps and, occasionally, those which immediately follow… In every trace form created by the body, both infinity and eternity are hidden.”

Joel Kramer – Yoga as Self-Transformation

  • “The essence of yoga is not attainment, but how awarely you work with your limits.”
  • “If you’re running from the feeling, it’s pain.” (Otherwise it’s just intensity.)
  • “Yesterday’s Level of Flexibility”. The (unhelpful) concept which I call YLF.

Desikachar –  The Heart of Yoga

  • Yoga defined, “attempting to do something you haven’t before.”


Oh headstand, “It’s like Wu-Tang says, you gotta ‘protect ya neck.'”

On stepping onto your mat, “Let’s go for a magic carpet ride.”

Travis Judd

  • “Make a conscious choice about what kind of practitioner you want to be right now.”


I can’t recall the provenance of these quotes sadly. Let me know if you can.

  • “The idea that we are ever not moving is an illusion.”
  • “asana is a process not a product otherwise we could say ‘not in a pose’ if head isn’t touching knee, but that is false.”
  • ‘Yoga’ has the root ‘Yuj’ which is the root for the English word ‘Yoke.’
  • Like humans, “water is transparent and reflective but don’t see those properties when in motion.”
  • “If you feel like you’re being inauthentic start telling the truth.”

















WIGI, an Inspire Grantee

WIGI, the Wikipedia Gender Index, my project which looks at the gender representation in Wikipedia Biography articles, has won an Inspire Grant.

Over the last six months along with fellow Wikipedians we prototyped and extended this research into a paper Gender Gap Through Time and Space: A Journey Through Wikipedia Biographies and the ‘WIGI’ Index”. One aspect of the biography gender gap we were not able to observe however was the trend of female and nonbinary biography.  We were only ever looking at a single point in time because it’s too computationally complex to compare all the histories of the Wikipedias together at once. Now, with $22,500 and a small team, our aim is to sample this data weekly thereby gathering some longitudinal data on the way that Wikipedians are representing biographies.

Our project’s form is to create a data portal which  will display the visualisations of the state of gender in biographies. The underlying data which associates biography gender with Wikipedia language, date of birth/death, citizenship, profession, and celebrity status, will be purposefully published under an open license. We hope that other researchers can make use of this social indicator, much the in same way one can United Nation’s Gender Inequality Index.

The project is will be managed entirely on github, and should be completed in about 6 months.

It promises to be,



Asking Ever Bigger Questions With Wikidata

This is a Guest-Blog I wrote for Wikimedia Deutschland: copied here:

German summary: Maximilian Klein benutzt Wikidata als als Datenfundus für statistische Auswertungen über das Wissen der Welt. In seinem Artikel beschreibt er, wie er in Wikidata nach Antworten auf die großen Fragen sucht.

Asking Ever Bigger Questions with Wikidata

Guest post by Maximilian Klein

A New Era

Simultaneous discovery can sometimes be considered an indication for a paradigm shift in knowledge, and last month Magnus Manske and I seemed to have both had a very similar idea at the same time. Our ideas were to look at gender statistics in Wikidata and to slice them up by date of birth, citizenship, and langauge. (Magnus’ blog post, and my own.) At first it seems like quite elementary and naïve analysis, especially 14 years into Wikipedia, but only within the last year has this type of research become feasible. Like a baby taking its first steps, Wikidata and its tools ecosystem are maturing. That challenges us to creatively use the data in front of us.

Describing 5 stages of Wikidata, Markus Krötsch foresaw this analyis in his presentation at Wikimania 2014. The stages which range fromKnow to Understand are: Read, Browse, Query, Display, and Analyse (see image). Most likey you may have read Wikidata, and perhaps even have browsed with Reasonator, queried with autolist, or displayed with histropedia. I care to focus on analyse – the most understand-y of the stages. In fact the example given for analyse was my first exploration of gender and language, where I analysed the ratio of female biographies by Wikipedia Language: English and German are around 15% and Japanese, Chinese and Korean are each closer to 25%.

To do biography analysis before Wikidata was much harder. To know the gender of an article you’d resort to natural language processing or hacks like counting gendered categories and guessing based on first name. Even more, the effort had to be duplicated for each language that had to be translated. Now the promise of language-free semantic data, and tools like Wikidata Query and Wikidata Toolkit are here. The process is easier because it is more database-like; select, group by,apply, and combine.

With this new simplicity, let’s review what we have imagined so far. Here’s a non-exhaustive introduction to the state of creative question-asking so far:

Pushing Ourselves to Think Even Bigger

Can we think even bigger if we use more of the available data? Thinking about the fact that every claim may have an attached reference, Markus Krötzsch always wants to know, for a given set of claims what references must be believed in order to believe the set of claims? With that notion we could look at all the claims associated with all the items of a given language, and thus the required belief system of that langauge. At this point we could ask what are the differences in the belief systems of any two langauges?

Another way we could test the fundamental principles of knowledge and culture is to consider the chains made by the subclass of, instance of, or cause of properties. Every language is present at different links of each chain. So we can look at the differences in ways in which languages organize a hierarchy of concepts – or if they think it’s a hierarchy at all.

Much fun for logicians and epistemologists. But we can also ask more socially important questions, questions about how language and society relate. What biases do we have that we aren’t even aware of? The method, for which I’ve proposed a PhD, could be conducted as follows. We’re aware of sexism in our societies, and as you’ve seen we’ve started to build a statistical profile of how it manifests in Wikidata. Likewise we’re cognizant of racism and homophobia. We might next look at rates people appear in Wikidata by race and desire. Let’s assume we could train a model to say that these kinds of distributions are types of social biases. Next we could search every property in Wikidata to see if it indicated social bias. If successful we may find overlooked stigmas and phobias in society.

I claim that our theoretical question-answering ability has paradigmatically shifted with the growing up of Wikidata. Soon enough you won’t even need to be a sophisticated programmer to whisper your questions into the system. So next time your reading, browsing, querying or displaying Wikidata, challenge yourself to think about how to analyse it too.

Which Index Is WIGI Most Closely Related To?

In my lastest paper “Gender Gap Through Time and Space: A Journey Through Wikipedia Biographies and the ‘WIGI’ Index” (blog post and on, my co-author Piotr Konieczny and I proposed a gender index. WIGI, the Wikipedia Gender Inequality Index, is composed of many indicators, but one in particular, the “nation-WIGI”, was designed to be comparable with other well-known indices. The nation-WIGI ranks each nation by the ratio of female biography articles who are  citizens of that nation.  Designed in this way it is possible to correlate WIGI to other indexes. And potentially, we thought, given enough indexes and with high enough correlations, we could get a sense for what WIGI is measuring in terms of other indices.

Due to word-count limits, we were unable to submit this research question with the rest of the paper, so it is included here. Formally we formulated is thus:

RQ4: Of the other Gender Indices which divide also by nation which index is Wikipedia most closely related to?

First let’s recap the four other nation divided indices we are inspecting (see section 3 of our paper for more detail).

  • GDI
    • The UNDP’s Gender-related Development Index (GDI) introduced only in 1995.
    • A gender-focused extensions of the Human Development Index. GDI’s primary focus lies in gender-gaps in life expectancy, education, and incomes.
  • GEI
    • The Gender Equity Index (GEI) introduced by Social Watch in 2005.
    • Developed to measure all situations that are unfavourable to women, it ranks countries on three dimensions: education, economic participation and empowerment.
  • GGGI
    • The Global Gender Gap Index (GGGI) developed by the World Economic Forum in 2006.
    • Intended to allow comparative comparison of gender gap across different countries and years, it focuses on four areas:  economic participation and opportunity, educational attainment, political empowerment and health and survival statistic.
  • SIGI
    • The Social Institutions and Gender Index (SIGI) of the OECD Development Centre from 2007.
    • A composite indicator of gender equality that solely focuses on social institutions (norms, values and attitudes), as well as on the four dimensions of family code, physical integrity, ownership rights and civil liberties.

    Comparison Data:

    With each of the above four foreign indices we have a ranking associating a nation (sometimes referred to as an economy) and an ordinal position. We would like to understand how close two indices are, for which we use the Spearman rank correlation coefficient. Two other technical points to be addressed are that we must use the intersection of  nations covered by each index to avoid missing data problems. And lastly, we compute a calibration step to find the start decade of Wikidata-data that maximises the correlation in question.

    The full source code of this calculation is available on github.  Also as an aside, I have another blog post on an functional-programming solution to joining many dataframes at once, that was useful in computing these results.

    Finally we produced a comparison table of indices,  their correlation, the correlation significance, and the maximizing start decade.  We present it ordered by correlation:

    National-WIGI compared to Alternative Indexes


    Spearman Correlation


    Calibrated Start Decade


















    Each alternative index shows some statistically significant moderate correlation with our nation-WIGI index. This proves that the female ratio of Wikidata humans associated with a country is, at minimum, a legitimate addition to the landscape of gender inequality indexes.

    Additionally, the fact that each alternative index most highly correlates when we consider only those biographies starting around 1900 is a positive sanity check for our data. Intuitively this makes sense in the light of the fact that traditional indexes talk about modern history only.

    Still, what is the interpretation that our nation-WIGI is most highly correlated to GEI, and least with GDI? What do GEI and GDI measure that show what WIGI is measuring? We dig further into the methodologies of theses indices.

    Social Watch’s GEI explains itself that:

    “In Education, GEI looks at the gender gap in enrolment at all levels and in literacy; economic participation computes the gaps in income and employment and empowerment measures the gaps in highly qualified jobs, parliament and senior executive positions.”

    And the UN’s GDI reports itself as:

    “The new GDI measures gender gap in human development achievements in three basic dimensions of human development: health, measured by female and male life expectancy at birth; education, measured by female and male expected years of schooling for children and female and male mean years of schooling for adults ages 25 and older; and command over economic resources, measured by female and male estimated earned income.”

    So we find that both indexes use indications connected to education and economic activity. The differing factor ultimately is that the GEI additionally measures empowerment by positions of power whereas the GDI additionally measures life expectancy. This suggests that the ratio of female biographies by nation in Wikidata are more highly correlated to women’s positions of power by country than to life expectancy by country. That, at first glance, is commensurate Wikipedia’s notability policies. Notability in Wikipedia essentially defers to inclusion or absence in the journalistic and scholarly record. That means that humans in positions of power, as GEI covers, would would tend to be in Wikipedias in greater proportion. Thinking about GDI’s life expecetancy uniqueness, one does not obviously see a strong reason that those with greater life expectancy are more covered in Wikipedia.

    Clearly this is a very rough investigation, and our conclusions can only be limited. Yet we still have some evidence for Wikipedia’s notability policy effecting the gender representation. That link might be clear with some feminist reasoning, but the data also supports the notion. Surely this is a nice fact to know for those who criticize the notability inclusion as it stands.

    For questions or suggestions, contact me on twitter – @notconfusing.


Joining many DataFrames at once in Pandas: “n-ary Join”

Joining many DataFrames at once with Reduce

In my last project I wanted to compare many different Gender Inequality Indexes at once, including the one I had just come up with, called “WIGI”. The problem was that the rank and score data for each index was in a separate DataFrame. I need to perform repeated SQL-style joins. In this case I actually only had to join 5 dataframes, for 5 indices. But later, in helping my partner with her research, she came across the same problem needed to join more than 100. In my mind I saw that we wanted to accomplish this n-ary join. Mathematically I wanted this type of operation, which I couldn’t find in pandasjoin

The answer I enjoyed implementing, perhaps because I saw it as this type of repeated operation, is the reduce of functional programming.

Ok, say we have these two data sets:

In [5]:
Rank Score
Republic of China 1 0.356890
Kingdom of Denmark 2 0.347826
Sweden 3 0.345212
South Korea 4 0.343662
Hong Kong 5 0.342857
In [6]:
Rank Score
Iceland 1 0.8594
Finland 2 0.8453
Norway 3 0.8374
Sweden 4 0.8165
Denmark 5 0.8025

We’d probably join them like this:

In [7]:
wigi.join(world_economic_forum, how='outer', lsuffix='_wigi', rsuffix='_wef')
Rank_wigi Score_wigi Rank_wef Score_wef
Denmark NaN NaN 5 0.8025
Finland NaN NaN 2 0.8453
Hong Kong 5 0.342857 NaN NaN
Iceland NaN NaN 1 0.8594
Kingdom of Denmark 2 0.347826 NaN NaN
Norway NaN NaN 3 0.8374
Republic of China 1 0.356890 NaN NaN
South Korea 4 0.343662 NaN NaN
Sweden 3 0.345212 4 0.8165

But we want to generalize. Notice here we also inject the name of the DataFrame into the column names to avoid “suffix-hell” as I would like to term it.

In [1]:
import pandas

def make_df(filename):
    df = pandas.DataFrame.from_csv(filename)
    name = filename.split('.')[0]
    df.columns = map(lambda col: '{}_{}'.format(str(col), name), df.columns)
    return df

filenames = !ls

dfs = [make_df(filename) for filename in filenames]

Now here’s the reducer. I actually end up wanting an inner join in the end, but the type of join is not important to illustrate the fact.

Here we join 5 DataFrames at once.

In [2]:
def join_dfs(ldf, rdf):
    return ldf.join(rdf, how='inner')

final_df = reduce(join_dfs, dfs) #that's the magic
Score_gdi Rank_gdi Score_gei Rank_gei Rank_sigi Score_sigi Rank_wdf Score_wdf Rank_wef Score_wef
Nicaragua 0.912 102 74 37 53 0.8405 13 0.272727 6 0.7894
Rwanda 0.950 80 77 19 43 0.8661 134 0.096154 7 0.7854
Philippines 0.989 17 76 26 57 0.8235 6 0.322785 9 0.7814
Belgium 0.977 38 79 12 1 0.9984 73 0.163734 10 0.7809
Latvia 1.033 52 77 19 24 0.9489 82 0.157623 15 0.7691

I really like the elegance of this solution. I admit there may be other ways to go about it with pandas only, and I understand the R mentality of “no for loops”. Still this is precisely why I like pandas in python – you still get the freedom to play as you wish if it makes more sense to you.

Cyberwizard Institute: Retrospective


Cyber Wizard Institute

The Cyberwizard Institute  (CWI) was a free programming school based out of Sudo Room, running for the month of January 2015. The proclamation that I saw on their website before I volunteered to teach there was:

cwiThe idea is to be an anti-bootcamp. Anyone can participate. It’s free. We’re going to try hard to have lecture notes, assignments, and lecture livestreams up online. It will be primarily self-directed, but with guidance from higher level wizards.

As a founding member of sudoroom since 2011, but suffering from a recent malaise in my hacktivism, this was the perfect project to reinvigorate my involvement. What most appealed to me was the idea of an anti-bootcamp, because I’ve wanted to make clear to world the distinction I care about between start-up culture and technology. I wanted to do something metaphorically akin to hijacking the stereo system at a $4-coffee-wifi-shack and making a public service announcement that the computers are not just fancy TVs, but programmable instruments of self-empowerment, which, in addition, can be used for non-commercial purposes.

Meeting Every Day

Without any formal advertising, each sudoer leading CWI was pleasantly surprised when 27 wizardlings showed up on the first day (14 women and 13 men from my count).  When I remarked this to CWI’s originator @marinakukso, she responded that “when you offer a free programming class, with no experience required – people want that”.

I recall some apprehension when we introduced ourselves, and there was the occasional naïve posturing  of people who claimed themselves as programmers with the phrase “I know HTML”. But the need to impress quickly disappeared as we sat down to struggle with them in installing Linux on the laptops they’d brought.

The next day I was nervous with anticipation to arrive at an empty room after all we had shown fresh minds was that computer programming was about inexplicable Ubuntu hurdles. Still, with only a slightly leaky attendance most wizards did come back for more. And we went right on with teaching them bash.

We continued to meet for 5 hours daily with lectures and hackerspace-esque hands-on floating help from higher level wizards, which we dubbed “social code”. Our rhythm was found quickly, and only half way through the month CWI was feeling so magical, it received coverage in the East Bay Express:

“Many coding bootcamps in the Bay Area charge tens of thousands of dollars in fees, which can be seen as restricting access to what has become essential for finding a job in technology, let alone moving up in Silicon Valley’s so-called “meritocracy.” Kukso explained that Cyber Wizard Institute’s mission is very much aligned with that of Sudo Room, which is to give everyday folks the opportunity to understand and create the technology in their lives. “For a lot people who consider themselves nontechnical,” Kukso said, “a lot things relating to technology or coding seem mystical or secret, our perspective is … everyone can learn these types of things.’

Pedagogical Questions

Yet towards the end, I started to question the effectiveness and importance of CWI. From the beginning as facilitators we quipped that “anti-bootcamp” reallly meant “bootcamp”. And the calendar began by reflecting that.

  • Day 1: Install Linux
  • Day 2: Unix and Bash
  • Day 3: vim
  • Day 4: HTML
  • Day 5: javascript
  • Day 6: Networking
  • Day 7: Node.js
  • Day 8: Git
  • etc…

Which is exactly the way that substack, Oakland’s pre-eminent “unix philosopher,” would have it. Yet, that was before the collaborative aspects took over and I began to try and think about how I would teach a less trained non-programmer version of myself what I know now. I mixed in:

(click to view the recorded lectures)

Where substack was spreading his knowledge of artisinal web-buildery, I was attempting to proselytize a world of Mathematical elegance. At times I was worried this felt interfering and competitive to the wizards.

However the final projects did come to life, instigate solely from the intrinsic motivation of the new-wizards. On the last day arduino hacks and personal-itch websites really had materialized. After speaking to those who made it all the way through the month, they spoke of a brighter perspective than my own: perhaps we inadvertently succeeding at being an anti-bootcamp.

The Medium Was Always The Message

As another facilitator @Johnnyscript, at the  ending Cyberpunk Masquerade Wizard Initiation Ceremony, said we showed them what it coding is actually like – many differently opinionated hackers running around without too much top-down organization. We delivered the essence of the hackerspace more accessibly than just happening upon a room of silent geeks staring down. Our package, despite being a bit dishevelled, did form a solid curriculum, although it was not refined as something that you might pay $17,000 for. Yet it also was not an altar for silicon-valley start-up-ism.

Taken together, we find a point that I am surprised that I missed. Whereas  programming bootcamps are normally Cathedrals, as Eric Raymond might put it, we built a Bazaar.

Notconfusingly yours,

Your humble newb-druid.

Cyberwizard Institute II

“Will there be another Cyberwizard Institute?” many are asking. Likely, but it is as-yet unplanned because volunteer work is tiring. If you have the intitiative or want to hear about an inititiative, join our discussion tracker on github.


Preliminary Results From WIGI, The Wikipedia Gender Inequality Index

This is a preliminary list of results from a research project is being compiled into full paper on the subject.

The full paper, in it’s academic form is now available on arxiv.


WIGI is the Wikipedia Gender Inequality Index, a project whose purpose is to attempt to gain insight into the gender gap through understanding which humans are represented in Wikipedia. Professor Piotr Konieczny, and myself thought that, whereas some gender gap research focuses on the editors of Wikipedia directly, we would view the content and metadata of articles as a proxy measure for those editing. Although the notion of analysing Wikipedia content seems quite old, I believe the advent of Wikidata allows us a new range of ambitious questions to be asked.


We use Wikidata, the new semantic database that feeds Wikipedia. By inspecting it’s weekly data dumps, we are able to inspect all the semantic properties associated with every Wikipedia page in any language, all at once. In this case we focus on any article that is about a person, and their any data recorded for the properties gender, date of birth, date of death, place of birth, citizenship, and ethnic group (example). We do this courtesy of an excellent tool known as the Wikidata Toolkit.

We compare the found data to historical census data and the World Economic Forum’s Gender Gap Index.

For other computations we also supplement the original data with with aggregation maps to make cultures from place of birth, citizenship, and ethnic group, by using Mechanical Turk.

This project has been conducted in an Open Notebook Science way, where we have been posting our results and receiving feedback as we work. You can chat with us on-wiki, or on-github where all the code and data needed to reproduce this research is available.

Let’s begin:

 Summary Statistics

As of October 14 2014 we inspected a total of 2,561,999 or about 2.5 million “human” items, that is any Wikidata item with the property “instance of: Q5 (human)”.

On each of those items we look for the following additional properties and found  them no the following number of items.

% of total Items with property
ethnic group 0.30 7,772
country* 23.47 601,361
place of birth 23.93 613,092
date of death 28.79 737,522
citizenship 41.44 1,061,634
culture** 45.20 1,158,086
date of birth 57.92 1,484,003
gender 89.40 2,290,433
at least one site link 99.05 2,537,545
a “Q” ID 100.00 2,561,999


*country is determined by seeing if the place of birth is a country, or if it is a city, see if the city has a country property

**culture is determined by using translating ethnic group, place of birth, and citizenship into 1 of 9 world cultures as per Inglehart-Welzel map of the world with Mechanical Turk. Then we take the consensus of the three aggregated variables. (Actually there were no disagreements between the three variables.) All aggregation maps are available for inspection on github.

Now the first derived and naive statistic of interest – the total gender breakdown. As we’ve seen above 10.3% is of unknown gender, otherwise we encounter 75.7% male, 13.9% female, and <.01% nonbinary which is perhaps better described as 152 cases.

Sanity Checking With Historical Data

We want some sanity checking that the data from Wikidata reflects the world at large. To do this we compared our total population per year, calculated by date of birth, versus the world population.

Comparing the Wikidata data to historical census data  we find a high significant correlation in total population – Pearson correlation coefficent = .983 with  p<0.01. This lends some credence to the notion that this dataset reflects the world at large. (By the way the historical data trends backwards to 10,000 BCE, but the earliest date of birth in Wikidata is about 4,000 BCE.)

Total Biographies Over Time

These graphs show the absolute volume of items by date of birth and death by gender, and over all time, and 1800 onwards.dob_dod_totals_pretty

This first visualization of the gender gap shows how Wikipedia’s retroactive focus on history has been consistent in it’s bias in representing females. It’s also generally quite a smooth curve save for some noticable spikes around World War I and II.

It’s intriguing to contemplate how we might expect date of birth and death to be related. If they were equally well recorded – and barring extreme events like wars – the death curve would look like a right-shifted birth curve. However we see empirically that is not true. At all times the death curve remains absolutely smaller than birth, by a factor of about two-thirds. So we can see a bias in recording the date of birth more often than than date of death.

Gender Ratios Over Time

The indication of visual skew in gender prodded me to look at how the ratio of male female and nonbinary genders develop over time.

Note: From here I aggregate the nonbinary genders into a single class not for philosophical reasons of them, but for the ease of visualising the more dimensions they represent. I consider it import to be descriptive about what is found in the data, and to not to lose any perspective because of personal assumptions about gender. If you think there are better ways to describe this data, I would be glad to here from you.

We adjust our viewing window here to start at 1400CE here because the data is too sparse to provide meaningful visual data.


Curiously since about 1800 to present, the female ratio of biographies is greater when using the date of birth measure than the date of death measure. What is the interperation? Somehow recording female date of birth is more prominent in a way that recording date of death isn’t. Although both ratios are rising, somehow date of birth is outstripping date of death. It would be great to investigate how much this is owed to recording practices and how much it is owed to social phenomenon.

Notice after about 1990, the spike is very large, and even crosses 50%. This is more statistical anomaly than anything else, since the number of humans with date of birth about 1990 is very small as you can see in the volumes plot. There are only 12,000 entries with date of birth in 1990 and only 199 biographies born in the year 2000. Even with discounting very recent trends of the last 20 years, which describe humans that are just entering adulthood or younger, the female ratio is rising exponentially. I was expecting to fit a logistics curve to the female percentage so that we could predict when we might reach parity, however that notion does not makes sense with what is being shown. Although there it may not necessarily indicated equity, fitting an exponential model to this percentage we can calculate when the female percentage would reach 50%. By our calculations it would be February 2034 when the exponential extrapolation would reach 50% female representation.  But of course predicting growth of percentages can be lead to nonsensical results (as humourously shown in this xkcd comic). I suspect we will see a logistics model, but simply haven’t encountered the inflection point of slowing rate of growth yet.

Aggregating Cultures

First some caveats as to method which we use in the next section.

  • There is no good way to aggregate cultures perfectly. Aggregation in general assumes some loss of fidelity. The point in doing so is to gain a broader-stroke picture, and in this case simplify visualizations.
  • The method we used for aggregation – starting from the Inglehart-Welzel map of the world (right), and then “mechanical turking” in the rest of the values – comes loaded with it’s own cultural baggage and perspective.
    By DancingPhilosopher [CC BY-SA 3.0 (], via Wikimedia Commons
  • Inglehart-Welzel map only really makes sense for modern geopolitical boundaries. For instance the notion of having a Protestant and Catholic world before Protestantism and Catholicism, does not make sense. We use those soft modern boundaries superimposed over the geographical region to determine historical values. So if you were born in ancient Greece, you are known as Orthodox in this method.
  • Some ethnicities were a mixture of two cultures, like “Thai-American”, in those cases we took the modifier, so we’d use “Thai” -> South Asian. There are two ways to do this and both of them are not very good, we make a compromise to get a rough picture. The full data is available for more munging if you would like to fine tune it.
  • We aggregate the the 9 cultures from 3 similar but different Wikidata properties in citizenship, ethnic group, and place of birth. Since each of those are different concepts, a conflict may arise – however in this research we did not find a case where different property aggregations gave different world cultures.

Gender Ratios By Culture

We make a cross-tabulation of gender by culture. A Chi-squared test show the observed distributions of gender by culture to be significantly. We now graph the female percentage of biographies by culture.


More than anything, I think what astounds most is the large different in the difference in absolute number of biographies by culture. The European and English-speaking world dominates by a large amount here. Although, it might be that European and English-Speaking biographies are simply more likely to be described in Wikidata at the moment, by some sort of quirk of the volunteer import process. Later we’ll see how that affects German and Austrian items.

If we do inspect the female percentages as-is, we find a very high showing for in the Confucian culture. After talking to some Confucian-world Wikipedians on twitter (who I can’t find now to credit) and fellow Wikipedia Researcher Hai-Yi Zhu from University of Minnesota, we produced the hypothesis that this is because the phenomenon of celebrity is larger in those cultures, and celebrity is more evenly gender-distributed. We will investigate the celebrity hypothesis in a bit. If you have another hypothesis, we welcome your input for testing.

We provide the graph of nonbinary percentages of biographies by culture too. The cultures are ordered in the same way as the female graph for ease of comparison. Notice that the ordering is relatively similar to the female graph – so on the surface, recording female biographies is linked to recording nonbinary genders too.


 Gender Ratios Over Time

Lets mix all these variable now, by viewing the culture ratio trends over time. To note our sample size as we continue, only 951,101 or about 35% of total records have all of date of birth, culture, and gender data.



You can see that the recent past around 1800 is a low point for female recognition in all cultures and most of history in the past 3 millennia. Likewise visually it is evident that historical trends in different cultures have, while not reaching 50%, peaked at much higher percentages. In the modern historical graph, we can see a rise occurring for all cultures, and super-linear growth even for the Confucian and South Asian countries. .The sky-rocketing ratios after 1990 are less significant as noted above.

Gender by Wikipedia Language

Now let us recall that there is one more dimension we have recorded, the sitelink dimension, which indicates whether or not for an item a Wikipedia language has an entry for it. To be clear, say for instance that Finnish Wikipedia has an article about a Japanese human; we would be commenting on Finnish Wikipedia. With this data we can analyse the female and nonbinary tendencies of a Language, not a nationality or culture.

Here we have plots that show the relative frequencies of female articles per Wikipedia language, versus the size of the language.


And again for nonbinary humans.


Notice in general there is no simple trend linking Wikipedia size to female representation. The visual technique with which I investigate here is to look at for the points whose magnitude from the origin is greatest. Mostly I see relatively a  flat constant rate, with a few Wikipedias standing out a bit, like the Japanese, Chinese and Tagalog. So again we are seeing some evidence for Confucian and South Asian cultures being less gender biased when following the sitelink method analysis.

Gender by Aggregated Wikipedia Language

To sure up the idea of cultural influence in the sitelinks analysis we aggregate the languages into the nine World Cultures as before. In this case, since there are only about 280 languages, I assigned all of the languages by hand, rather than resorting to Mechanical Turk.


To clarify, the technique used here is that every Wikidata item counts towards a culture if a sitelink exists in at least one language associated with that culture. So if an article has language links to English, Chinese, and Japanese wikipedia, that item counts only once towards each of the English-speaking and Confucian categories.

Now we have a more coherent picture about which types of Wikipedias by language are focusing on female articles. And we do continue to see a high Confucian showing.

Let us test our celebrity hypothesis. For the Chinese, Japanese, Korean, Tagalog, Urdu, German and English Wikipedias, we retrieved the page content of each Biography from 1930 until 1989 (recall that there are very few Biographies with date of birth 1990 and higher).

We search for the English or foreign language words that are associated with celebrity. The dictionary used is:

{'jawiki': [u'俳優', u'選手', u'歌手', u'ミュージシャン', u'モデル', u'アイドル'],

'zhwiki': [u'演員', u'運動員', u'歌手', u'音乐家', u'模特兒', u'偶像'],

'kowiki' : [u'배우', u'선수', u'가수', u'음악가', u'모델', u'우상'],

'tlwiki': [u'artista', 'aktor', u'player', u'mang-aawit', u'musikero', u'modelo', u'idolo'],

'urwiki': [u'اردو', u'کھلاڑ', u'گلوکار' , u'موسیقار' , u'ماڈل', u'بت'],

'dewiki': [u'schauspieler' , u'spieler', u'Musiker', u'Sänger', u'Modell', u'Idol'],

'enwiki' :[u'actor', u'actress', u'player', u'singer', u'musician', u'model', u'idol']}

If you can provide better translations than Google’s software, let me know. We consider a celebrity to be a biography that contains one of the above words within the first 200 characters of its Wikipedia entry.

Then we make a heatmap comparing the language, the decade and, the gender, and celebrity percentage.


Using visual inspection, at first glance we can see that the female matrix is darker in general that the other two matrices. So recorded females are more likely to be celebrities among these languages.

Likewise you can see that in general the heatmap transitions to being darker at the top than bottom, so we have shifted to being more celebrity conscious in most languages in recent years.

Lastly we see some vertical-striped features showing that for instance Tagalog is prone to being celebrity conscious across gender and time.

To determine the significance of the effects we perform a logistic regression analysis in predicting the celebrity percentage variable. The coefficient matrix is printed below.

coef std err z P>|z| 1
decade 0.0236 0.013 1.823 0.068 -0.002 0.049
enwiki 0.0509 0.875 0.058 0.954 -1.664 1.766
jawiki 0.7763 0.837 0.927 0.354 -0.865 2.418
kowiki 1.3834 0.832 1.662 0.097 -0.248 3.015
tlwiki 3.0009 0.945 3.176 0.001 1.149 4.853
urwiki 0.8901 0.869 1.025 0.306 -0.813 2.593
zhwiki 0.5383 0.846 0.637 0.524 -1.119 2.196
female 1.3580 0.453 2.999 0.003 0.471 2.245
intercept -47.9056 25.368 -1.888 0.059 -97.626 1.815


Depending on which arbitrary significance threshold you choose to use, we find different answers, but at least the female, and Tagalog, variables are significant with p<0.05. If we loosen the significance threshold slighly, decade, and Korean also become predictors. This lends a lot of credence to the notion that in the cases in which Women are recorded in Wikipedias, they have a strong tendency to be a celebrity.

Connections To The World Economic Forum Index

Indexes are useful, but they are more useful as a group of compatible and comparable indexes. We compared our place of birth and citizenship data as it related to gender, to the World Economic Forum Gender Gap Index. The World Economic Forum uses its own methodology to produce a scalar value on the interval (0,1) to rank the gender equality of a country. To match to that format, we take the Wikidata data in the form of female composition of biographies by country.

We performed a calibration step to see which time window of data would produce our ranking of countries most closely being correlated with the World economic forum. If the Wikidata dataset is used with the time window only considering the biographies with date of birth between 1890 and 1990, the Spearman rank correlation is 0.31 with a p value of 0.03. That means that there is some founding for accepting the female composition of Wikidata items of humans associated with a country as an inequality index, because is significantly correlated with other respected inequality indexes.

Here is a sample of the two rankings side-by-side. We display the top 10 as per the World Economic Forum rank, and then the top 10 as per the Wikipedia Rank. You’ll aslo see the associated WIGI rank, the raw scores for each, and the difference in the ranking.
World Economic Forum Top 10

Country WEF Rank Wikipedia Rank WEF Score Wikipedia Score Rank Difference
Iceland 1 30 0.8594 0.1895 -29
Finland 2 39 0.8453 0.1807 -37
Norway 3 22 0.8374 0.2142 -19
Sweden 4 1 0.8165 0.3452 3
Denmark 5 20 0.8025 0.2149 -15
Nicaragua 6 9 0.7894 0.2727 -3
Rwanda 7 108 0.7854 0.0962 -101
Ireland 8 64 0.7850 0.1586 -56
Philippines 9 3 0.7814 0.3228 6
Belgium 10 58 0.7809 0.1637 -48


WIGI Top 10

Country WEF Rank Wikipedia Rank WEF Score Wikipedia Score Rank Difference
Sweden 4 1 0.8165 0.3452 3
South Korea 117 2 0.6403 0.3437 115
Philippines 9 3 0.7814 0.3228 6
Bahrain 124 4 0.6261 0.3171 120
Mauritius 106 5 0.6541 0.2941 101
People’s Republic of China 87 6 0.6830 0.2812 81
Australia 24 7 0.7409 0.2760 17
Japan 104 8 0.6584 0.2732 96
Nicaragua 6 9 0.7894 0.2727 -3
Swaziland 92 10 0.6772 0.2593 82

We see how the rankings bear some similarity, but that the correlation is mild. Still we can take away that the notion of what the WEF is driving at with it’s measure, and the number of female biographies that exist about humans in a country, as somewhat related idea.

Data Reliability

The question of how well Wikidata accurately reflects all Wikipedias, is important to determine before addressing the question of how well Wikipedias reflect the world at-large.

During our research, we found a curious quirk in the way that nationality is recorded, and the story is instructive in showing that Wikidata still has a few artefacts of its bot-imported nature. A more in-depth analysis, I previously blogged about is available in a post about the “Wikidata and the Measure of Nationality“.

In short, the idea centres around an early finding, that indicated that Protestant European humans seemed to disappear in the 1930s, when we were determining culture just using the “Place of Birth” property. It looked like this:



This is what lead us to investigate how nationalities were being classified on Wikidata. The next graphs show which humans have which classification method – by place of birth, citizenship, or  both – for nationality. For Germanic humans we saw a large shift:


And for all other populations we witness no such thing:


After publishing these finding, a Wikimedian wrote in to explain that the import of Germanic human data into Wikidata occured through a bot called “FischBot”, and that the shift is likely only related to the way that that software operated. The moral being that we should still be vigilant in staying aware of the data quality in Wikidata.


It is not my intention to draw any large scale conclusions at the moment. For that I will wait until the publication of the paper for which this analysis is intended. Still I would be glad to hear any insights you might see until then.


We finished the writing the paper. An excerpt from the conclusion there:

Our research confirms that gender inequality is a phenomenon with a long history, but whose patterns can be analyzed and quantified on a larger scale than previously thought possible. Through the use of Inglehart-Welzel cultural clusters, we show that gender inequality can be analyzed with regards to world’s cultures. In the dimension studied (coverage of females and other genders in reference works) we show a steadily improving trend, through one with aspects that deserve careful follow up analysis (such as the surprisingly high ranking of the Confucian and South Asian clusters).


Tweet at me @notconfusing .


I programmed all of this research in using the IPython notebook, and it’s all entirely open source and hopefully reproducible from

I plan to start parsing and filtering Wikidata montly to provide updated data, which should be coming soon.

  1. 0% Conf. Int.

The best part of soup is that soup doesn’t have parts.

This is a piece I wrote for Bulbes, a zine about soup.
The concept of  “smooth” versus “striated” spaces  (striated means lined or striped) by Deleuze and Guattari, is a stream of philosophy with a missing interpretation through soup.
Depending on your choice of violent conflict, the smooth/striated dichotomy has popularly been depicted in the film “Die Hard” and the 2002 Israeli Operation in Nablus. In those scenarios parties moved not through the striated spaces of elevators, stairs, streets and squares, but through smooth and direct routes like air ducts and holes cut in walls.
I prefer a more pastoral example to grasp the notion of seeing a smooth space under a striated one. Northern Californian valleys are, by default, quite literally a smooth space with rolling hills. A striated framework is imposed on top of them in the from of designated walking paths. This New Year’s Day I was hiking there with some friends when our crew realized that the right side of the bifurcated path we took was a dead-end about 10 metres in. Some walked backwards along the right path to the junction point to continue down the left path, but my girlfriend Adi walked across the imaginary hypotenuse directly thereby stepping off any of the paths. That is seeing the smooth under the striated.

Soups are the ultimate smooth in the foodspace. You can eat any part of the soup, at any time, by gracefully carving through the three-dimensional liquid with your spoon. The same cannot be said for a pie, sweet or savoury, as it has an inside and an outside, and a top and a bottom layer. Because of this you must eat its parts in an order, and that order may come with a value judgement. Likewise a burrito in foil is not smooth, it is a striated space of one stripe – you are supposed to eat it front to back in linear order. Even a combination of potatoes and peas on a plate must cross the threshhold of your lips in some discrete relative positional arrangement. That is not true for the soup group; a hummus, a miso, or a lentil stew. You can eat those in any order – or rather there’s no order speak of. There being no order frees us from their being judgement of how we ate and eschews a level of social etiquette.

Perhaps one of the great social orders and hierarchies that exist, at least in the Western world today, is the in relationship space. The way I have been taught to view relationships in my life, largely donated by the patriarchy, is striated. Here is a map of the mental conception of the pathways of relationships for me. I wonder how true it is for you?

Striated Relationship Space
Striated Relationship Space

It does not look like a soup. It is much more like a high road or a main drag, with a few side alleys you might find yourself unlucky enough to fall into. Relationships, says the striated-theory, are places to go with junctions and turns to take.

But that’s just one way to view it. In fact, in a conversation I was having with another partner last year, we decided to start having less sex. Quite easily we could slip into viewing this as that we were “Friends with Benefits” or “non-exclusive partners” now moving into the “Friend zone”. Let’s resist the temptation to impose striation. Let’s think about it as soup. The relationship space is more greatly simplified when soupified in smooth-theory.

Smooth Relationship Space
Smooth Relationship Space
Updated: A second way to think about a soupier model without an explicit time axis, although the a curve is drawn inside the space to demonstrate how a relationship might evolve over time.


(Notes: the image I have drawn here is not accurate for the relationship I described in the last paragraph. Additonally if anyone wants to take a stab at redrawing the smooth relationship space without a time axis I would be glad to post alternate imaginings).

In the relationships-as-soup-iverse, that conversation can be viewed as a modification of the composition of soup. The soup still exists, it might taste different, it might be more or less substantive, but it is still soup. Transferring the notion that soup has no order and carries no judgement, likewise the relationship is never eaten, or slurped wrongly. You cannot break etiquette or convention by having traversed the relationship roads poorly.

When I found this viewpoint I was relieved because I was having fears about whether I’d broken her expectations, or upsetted the social order between our mutual friends. When I switched from viewing relationships as a striation of discrete statuses, to a smooth soup of two people, I was liberated in a small way. Liberated because the framework had no room for our mutually-agreed feelings to be against-the-grain.

Perhaps relationships, like soups, don’t have distinguishable parts that can be assembled in the wrong order. Let your soup act as a tiny reminder to the see the smooth under the striated.


I received this facebook comment, which I found instructive on how striated relationship space can frustrate those that aren’t willing to submit to a complete striation.


  • Dee Coetzee The first diagram I’ve heard called the relationship escalator. It’s problematic not only because of its required linear ordering, but because you are required to continue making your way up the escalator, or else get off altogether. There is no room for contented plateaus where nobody has any interest in anything other than what they already have. The escalator also contributes to the frustrating lives of asexual people – a lot of people aren’t willing to accept someone in a romantic relationship (much less a marriage) who doesn’t have sex, because sex is lower down on the escalator.

    I think the soup model actually works pretty well because the soup can contain any mixture of sexual, romantic, familial, or friendship elements, and you can stir in new ingredients over time, or just keep enjoying what you’re eating now, as you prefer. (It is kind of hard to draw though – it feels like a high-dimensional space more than anything.)


Omni Radio 1 – What is the Omni Commons?

In this inaugural Omni Radio podcast, Kwe and I introduce some of the collectives of the Omni, it’s purpose and it’s position in Oakland’s current landscape. To find out more visit .

The interviewees in this podcast are Cere from Counter Culture Labs, Jesse from Food Not Bombs, Margareta and Andrew from Backspace Wellness, and Jenny from Sudo Room.