Personal Statement: In Full

The “Personal Statement” for the graduate school application, is the attempt to explain how you will make a difference, not just in your research, but in making the University as a organism more equitable.

Luckily, my proposed research is precisely about  the kinds of social division that that the instructions to this document ask you to address:

Please describe how your personal background and experiences inform your decision to pursue a graduate degree.
In this section, you may also include any relevant information on how you have overcome barriers to access higher education, evidence of how you have come to understand the barriers faced by others, evidence of your academic service to advance equitable access to higher education for women, racial minorities, and individuals from other groups that have been historically underrepresented in higher education, evidence of your research focusing on underserved populations or related issues of inequality, or evidence of your leadership among such groups.

Each University wants a different length of personal statement (UC. Berkeley gives no specification, and University of Minnesota asks for a page). So I’ve drafted my most complete thoughts, and am posting them here before taking the editorial chainsaw to them. Plus I’m including the figure that I made from my transcripts which helped me think about disadvantages.

Personal Statement for Max Klein

This last month I attended my first ‘ecstatic dance’ event. I started to play a game with myself where I identified each unspoken assumption I was making about dancing, and trying to break them. I had to dance in one location, to not repeat a move too often, to stand, to take it seriously, to be energetic, to not have pauses, to synchronise it to the music. It’s a fantastic game to play, because the size of the list that you can make is astounding – and it’s never exhaustive. Like dancing constraints, the list of prejudices that effects who enters and succeeds higher education is long and never fully enumerated. And as in dance, higher education is shaped by it’s biases.

My discovery of the reality of social biases was a turning point in my life. The first time I awoke to my internalised racism a mix of discomfort-amazement overcame me. It was during a protest on the famous steps of Sproul hall. After a few rebel-rousing speeches a black woman came to the stage and started delivering activist poetry. The poem had themes of being a strong, earthy, celestial woman and was sung with jazz-swing. I grew bored and criticised the performance. And then I received a moment of self-awareness wherein I saw that I was dismissing the content because of the delivery. I remember feeling compelled to physically run away from the event, but it was unsuccessful because I was actually trying to run from myself. The entire night I sat in a tearful bewilderedness. I had been brought up to think of myself as not racist, and yet incontrovertibly I had just been racist. It was a painful moment that marked the beginning of when I realised I could have a lot of prejudices I wasn’t aware of.

After the illusion of a just world was shattered, my subsequent prejudices came to light more rapidly. My own misogyny became very real upon my reading of favourite academic Joseph Reagle’s Free as in Sexist deconstruction of sexism in Open Culture. (Of course it took a man to show me that.) I lost my religious dogma at the holocaust memorials in Berlin and Auschwitz, when I saw that accepting unquestioned messages were dangerous – including the ones from my Jewish family. Data showed me my homophobia when I made a spreadsheet of romantic partners grouped by whether or not I met them travelling, and then tried to explain why I was less straight when touring. Only last month the perusal of a blog of a woman I met at a wedding introduced me to the ‘fat stigma’ I had been harbouring. The continuous waves of realisation perpetuates my wonder at just how many unidentified stigmas I’m still holding.

The feeling I get from a solid attack on my belief-system is so powerful that chasing after it has become the driving force in my life. No wonder it’s precisely what my proposed research agenda is about. The proposal is to make an algorithmic social-bias-detector to run on crowdsourced databases. To think of it in concrete terms, let’s consider some observational research I hacked together and posted on my blog in 2013. It meant that I was the first person to see the proportion of female biographies across all Wikipedia languages. They ranged from 8.83% – Slovenian Wikipedia, to 19.97% – Serbian Wikipedia, compare English Wikipedia at 14.21%. There was a sinking feeling to see these paltry numbers, but also one of optimism to have performed a quanfication that can aid the correction of them. Despite how complex that moment was, it definitely informed me to double-down on this line of research. So I’ve proposed to mine all the properties in Wikipedia (or Freebase etc.), like profession, or properties that aren’t even about people, like the age of a concept, and see which fit into our model of what a bias looks like.

Through this detector I am trying to arm underserved academics with data. The aim is to provide a series of inequality indexes like the United Nations Gender Inequality Index. In fact I have already built a prototype, that compares gender, dates of birth, places of birth and ethnicity. This makes the inequality data scene richer because it comes from the the online Open Data landscape compared to opaque surveys. This will be valuable for academics currently researching inequality and make it much easier for other researchers to include inequality as a dimension in their projects. Moreover, the astounding unequal results of the internet’s own inequality index will generate lots of mainstream awareness of digital social biases which will lend credence to the access issue.

The second half of my continuing research, the search for unidentified biases, will address the “other individuals” of the “women, racial minorities, and individuals from other groups that have been historically underrepresented in higher education”. First let’s smash a binary. To be underrepresented in higher education is not a binary relationship. A person has many facets, and each facts has a different representation in higher education. Sometimes it seems like being underrepresented is reserved only for women and racial minorities; but I think that a big lesson and a nod to the work done by those coming from those backgrounds it to acknowledge a spectrum of underrepresentedness. Therefore as my project attempts to put it’s finger on many more social biases I hope everybody will come to see a part of themselves that is underrepresented. That will be very powerful I claim because once all people see themselves on that spectrum of underrepresentedness, it is easier to see the validity challenges of groups like women and racial minorities, since they are not in different class of challenges but rather of a same type of challenge of a different intensity. Essentially viewing yourself as a minority to any degree highlights the importance of good allyship society-wide.

When I insert myself into the spectrum of underreprestentedness the main challenge that comes to the surface is the lack of formal education in my immigrant family. Going over my transcript to submit in this application, the arching narrative it told obviated the difficulties of having no parents that had been to college before me.

At first after high school I didn’t go to college at all, I wasn’t suggested or pressured to, I was simply left to my own devices. It took some years until the social stigma (here’s another one) of having bad grades and low education frustrated me and so I began junior college with a man-on-fire attitude. Bolstered by the successful waypoint of transferring the educational passion continued. Yet towards the end of my college career, I felt wasn’t working towards a goal anymore as I had been, and my motivation slipped into a depression. This was the challenge because my family could not advise me on why to sprint to the end of that degree. I overcame that challenge by redirecting my energy in to real-world endeavours, which is when my Wikipedia volunteerism first blossomed – and serves as the basis for my research today. Even more, I’ve learned from that mistake by preplanning for the next depressive phase in graduate study when I inevitably encounter it, so instead of deflection I can greet the uncharted feelings with more reaching out, both to advisors and my past self.

I recall at that time of difficulty for me how my friends could phone their degree-holding family members, and received a boost over the shakey potholes so that they never amassed into a derelict street and their vehicle never totally fell apart. It’s really a class difference, when you imagine the family as an organism – like how an idea is an organism in memetics. It’s easier for the family to repeat behaviour rather than create new pathways. Having this handhold of minority-ness in education I believe has allowed me to see a galaxy of discrimination, and become a leader in changing them.

It has actually been in educational access that I’ve been a leader in change. After avenging the stigmatic education perceptions with my college degree, I had all desire to keep on learning, and none to impress anyone. In the winter after college I founded Sudo Room “hackerspace and creative community” along with 23 others. There we taught and learned mathematics and technology completely free of institutions for no money, as a non-profit, and for no ulterior motive. It was precisely because it was free as in speech and beer that we had community members, in including children and their school teachers, many of them people of colour, turning up for our “Today We Learned” sessions.

It was at Sudo Room I encountered radical activists that turned me on to start practicing my allyship seriously. I lapped up all the geek-feminism wiki anyone could link me to. I donated to, and took as much training as Ada Initiative would allow a cis male to participate in. Now I refuse be on all male panels, and admonish sexist remarks online and in person with best practices. I know there’s no special reward for doing the minimum things to not be sexist – that was a further humbling realization. The unconfortable epiphany cemented my dedication to this sort of DIY social uplift, and meant my being treasurer, the one person backstopping financially responsibility for a not-yet 501c3 nonprofit. I’m so serious about making an equitable society I laid down my own bankruptcy.

For all its merits, two years pursuing career hackerdom, I feel like I’ve hit a wall which I want graduate study address. I believe that in principle in today’s connected world one can fulfil their career without an institution – that would be a path where those coming from less educated backgrounds could flourish. But that is the exception, the theory breaks down with factors like closed access papers, and being propelled by serious, high quality collaborators. So as much as I might disagree with the concept in the abstract – because of it’s bias against women and racial minorities, etc. – I do want and need some thing from the institution. This reluctant acceptance is what has caused me to focus my interest in computational social science research directly on unequal representation on the web and by proxy in institutions. There is no need for me to translate my work and history into how it will aid representation in institution because my very work and history is about unequal representation in institutions.’s Curatorial Lesson: Constraining Users Less Makes Them More Collaborative

I interviewed with this past week through the fact that they are wanting to be more ‘Wiki’, and I am looking a way to fund my Wiki-based research. After a few videochats, it didn’t quite work out between us, as they are not set-up for housing pure research just yet. But there were some quizzical results that came out of the interview process around user-trust.

I did not want to take their standard programming test because under good advice you never should. Instead I suggested that as a work-trial I run and report the collaborativeness measures I developed this year (accepted to CSCW ’15) on their data. Genius accepted this idea, and you can read the full report below or jump directly to it at the NBviewer.

The most interesting thing that comes out of the report is that their subdomain – their subdomain for miscellaneous texts that aren’t rap or news or history etc. – performs very well under the collaborativeness measure in contrast to all other subdomains. As I wrote to them:

X doing very well is a surprising result to because one would imagine that the “no specific subject nature” might make it more jungle-like and thus less collaborative. The results is actually quite well explained by considering the humorous but true “Zeroeth law of Wikipedia: Wikipedia only works in practice in theory it can never work.” Counterintuitively, people collaborate better with less constraints rather than more. As people are given more freedoms online they respond well due to unrealized incentives. From an Wikipedian’s perspective this is makes a lot of sense, that a company can never make decisions for the community as well as the community.

This, I relayed, was footing for doing away with genius curating subdomains top-down and either folding everything into one large category. Or, I added, if they were really attached to their tradition perhaps a wikia-style create your own subdomain philosophy could work as well.
It’s astonishing how poorly people understand commons-based peer production outside of the Open Culture world, and cling to the world of the Yahoo! portal homepage. The positive lesson for me is that this was domain that showed that more user freedom means higher collaborativeness – an intriguing  piece of evidence to file away.

Should I Do My PhD In The Open?

“Whenever a work’s structure is intentionally one of its own themes, another of its themes is art.” ~Annie Dillard

My axe became stuck attempting to split this wood by myself.


It was a warm afternoon in Paradies – the park in Jena, Thuringia – exuberant children were pretending to be snakes and crocodiles, and I was attempting to understand what I wanted to pretend to be. My current thought was that a PhD in the computer science / information science realm with a focus on Free Culture was a path forwards as I explained to my mentor Daniel Mietchen. Neither persuaded nor unconvinced he socratically proposed, like the Free Open Culture advocate that he is, to open the problem up. That is he suggested that I should do an “Open PhD”. It’s first component he said should be blog post entitled “Should I Do My PhD in the Open?” which to serve as basic argument that I could come back to in the eventuality of a PhD-induced depression.

Let’s start with the fundamental question of motivation. Since I let this post sit in draft-phase too long, and started my applications before finishing it, I can no longer provide application-untainted answers. Let us use that fact to design a new avenue of inquiry. To answer the questions of “what?” and “why?” we can analyse my responses to the basic components of a typical PhD application, Statement of Purpose, Personal History, CV and Letters of recommendation. To answer the question of “how?”, I will take the advice of Annie Dillard’s quote, and use an open git repository to express my opinions about and share my applications.

Research Agenda – What I Aim To Do

I am pleased with the research proposal that I’ve laid out in my Statement of Purpose. It’s sufficiently grand and challenging, and the start of the path lays right where my past has delivered me.  I lay out:

“[…] a research agenda to classify the already-known social biases by as they appear in the socio-technical fora, and then to search for unidentified phenomena using those classifications. As an explanatory example, create a statistical model of how the known skewed distributions gender, race, and nationality exist in Wikidata, and then inspect all the property distributions for properties that match the biased patterns. The project grows more complex by allowing property-pairs (e.g. gender by race), different social-technical communities (e.g. Freebase, OpenStreetMap), and different models of bias (e.g. editorship-measures). If successful we may find overlooked stigmas and phobias in society, and all the while building a massive Open Dataset of comparative indexes of known social biases.”

What I particularly like about this proposal is that it drives at the limits of our knowledge – the term for that I use for it is “Rumsfeldian Unknown Unknowns” (not that I ever had an opinion about his politics but it’s a stand-up phrase). The problem excites me because its in the vein of “what is reality?” but lays out a plan to answer that more scientifically than philosophically. It’s also got enough holes – in that I need to learn more about machine learning, and enough strong points – in that I’ve already been parsing and analysing these current data sets. The fact that the question touches and expands on the currently hotly discussed gender gap is an added bonus for application-appeal, but is only coincidental because I really have been thinking about the bias problem in abstract form before I knew about the gender-gap.

Personal Ambition – Why I Aim To Do

The personal statement – which at the time of this writing remains a few scribbled notes – should, when finished, attest to two strands: mind expansion, and a battle with self-discipline.

I wept, the first time I witnessed in clear terms my internalized racism. Well actually I ran away from a protest when I became cognizant that I was listening to the black and female speakers less than the others, and then later I cried. In the underground computer lab, pouring my thoughts into a text editor, I clarified that my thoughts were a product of unquestioned norms, but verifiably wrong nonetheless.

The other strand is about materializing an internal cattle-prod to make my productivity commensurate with whatever natural sharpness I have and however much middle-class privilege it might have derived from. It took some time to start trusting the feeling that I was as dependent on a master as a dog,  and to stop trying to please schoolteachers (even if they were my actually bosses). But now I’ve quit my job to live frugally on open source contracts, and research for autodidactic pleasure.

Unless you take extreme measures to avoid being involved with any institutions you’ll always need a CV. In creating mine so far I seem to over-mention “Wikipedia.” It’s not a terrible thing, but a reminder to focus the idea on ‘socio-technical’ systems in general and not on any one of them in specific. This is as true for a reframing of my thoughts as much as it is a single document.

In seeing what I’m focusing on in my personal history, I have to remember the importance using of my privilege correctly. Typically the notion of a personal history is to talk about the overcoming of disadvantages. Being middle-income, male, white, tall, relatively straight, and a native-English speaker means prejudice is lower in my life. The real disadvantage that comes with being so advantaged is the difficulty not to admit or check your privilege. It’s hard to champion change in the world when you’d be least benefited by it – but that is the only interesting course of history from here. This is why what I want to do is be able to hold a statistical mirror to the internet and maybe that will be able to awaken a few disbelievers. (Get ‘em where it hurts – right in the logic, I know their weakness.)

Lesson Needed – Who I Aim To Impress

With my applications I am clearly trying to impress someone or something, and a more pointed question is what I want from them? Most of what I think a program would give me are: more education, more professional experience and career preparation, and the pressure-shelter to focus on answering my research agenda. Ultimately then I want things out of myself.

On the education point, there are some clear fields of study to improve. Studying social science is to put a finer tipped question onto the fumbly one I am asking now about bias. Studying machine learning and statistics are aimed to automate pattern recognition tasks that I’ve proposed in my research agenda. I anticipate having not much problem with my mathematics background in learning these things. In fact I secretly look forward to the endorphin rush of expanding my repertoire.  I only partly fear my personality rejecting the classroom setting and it’s hierarchy.

In terms of career preparedness the main thing I am trying to avoid is having a boss, or more accurately, avoiding needing a boss. That is not supposed to mean that I would like to learn the entrepreneurial art of “being your own boss,” since you would still have one at that point, just you would be it. If I could meet, be advised, teach, read, and work all in the direction of pursuing a goal that I’ve freely made for myself by myself then I would have met the goal that I’m making for myself right now.

To use the purpose of a doctoral study program to answer a long standing personal question would be a new height of self-fulfilment. It’s only a useful coincidence that other people will certainly benefit form the results along the way. However if I were doing it for the side-effect then I may have come to the wrong turn. But how will I know if I’ve lost my way? This is where the point of openness reigns.

The Point of Openness – How I Aim To Do

When I see the title “Should I Do My PhD In The Open?” I already see a misplaced emphasis. The question rhetorically centres around openness – that is, whether to apply the principles and ethos of Open and Libre culture to my academic pursuits. That aspect however is the most trivial. “Yes” is the answer.

I want to expand the idea of Open Notebook Science one step further. Imagine a whole PhD as an Open Notebook project. Even it’s pre-formulations should go online ‘as it happens’. Today, the emotions surrounding, and meta-questions of my PhD are online before I’ve even applied.

The point of Linus’ law is validity through mass scrutiny. The point of mass scrutiny is  the topic of my thesis. And the topic of my thesis is the point of the important questions in my life.

Please follow me follow this line of logic for the next few years.

“High Speed Rail: The Board Game” Review: The Nonviolent Utopian Interactive Senate Simulation

Board Games do not attract me by default. The closest I have come to enjoying board games is playing Risk, single player on my phone, and reading a lot about Nomic. But when Alfred Twu speaks, I listen, so if when announced his own fresh take on board gaming my ears piqued and my credit card edged ever so slightly out of my pocket. High Speed Rail: The Board Game emerged from High Speed Rail: The Map, of Guardian Fame. Neither the map or game, have rooted in any material claim to reality, but most things that inflame your imagination don’t.

High Speed Rail: The Map


The $15 printed on overhead transparency version arrived at my house a few days ago (versions go up to $100 with glass pieces). And last night 6 of us maxed-out the player count and sat down to squabble.

Foxy Pigeon Cottage Navigating the Fraught Ways of High Speed Rail Voters
Foxy Pigeon Cottage Navigating the Fraught Ways of High Speed Rail Voters

Except, there was rather little squabbling, and that completely changed my mind about what board games can be. The key thing to know about the HSR, is that you can’t move.

The situation is that you are given 3 objectives, of the type “Logistics Industry: Connect 2 of the 3 Cities – Detroit, Houston, San Diego”. Then you have to place tiles to realize your rail network – except as I mentioned – you’re not allowed to place tiles. Rather when it’s your turn, everybody else will place a tile as a proposal, and you will select another player’s proposal (or 2 proposals at a time for a 5+ player game).

This means that all your strategy comes by way of surveying other player’s objectives and finding the most mutual ground with them. This is slightly complicated by the fact that only 2 of the objective cards are publicly shown, and the 3rd one is kept secret. Barring this small niggle the games is not competitive at all, you are constantly trying to find compromising ground – in some ways a very realistic,  and in others a very unrealistic version of democracy in the US.

Unlike most games, I didn’t feel frustrated for having lost to unfair rules, or scheming opponents. It wasn’t too long, only about 45 minutes (and it was everyone’s first time playing).  To play is to softly massage an unfolding  system. The game has CC-BY-SA logos on all the pieces, and everyone is mostly cooperating. High Speed Rail is the nobly amateur, egalitarian pastime that would be played in Ecotopia.

You can can download a free print-it-yourself version here.

Häskell und Grepl: Data Hacking Wikimedia Projects Exampled With The Open Access Signalling Project

In what could easily be a recurring annual trip,Matt Senate, and I came to Berlin this week to participate in Open Knowledge Festival. We spoke at the csv,conf a fringe event in its first year, ostensibly about the comma separated values, but more so about unusual data hacking. On behalf of WikiProject Open Access – Signalling OA-ness team, we generalized our experience in data-munging with Wikimedia projects for the new user. We were asked to make the talk more story-oriented than technical; and since we were in Germany, we decided to use that famous narrative of Häskell and Grepl. In broad strokes we go through: how Wikimedia projects work, history of Wiki Data-Hacking, from “Ignore All Rules” to calcification, Wikidata told as Hänsel and Gretel, signalling OA-ness, how you could do it too.

These are the full slides (although slide show does not seem to like our Open Office document so much):

And a crowdsourced recording of the session:

We missed half of lunch with the queue of questions extending past our sessions, which was fabulous to see such interest. There is a particular affinity we found with the Content Mine initiative, which wants to programmatically extract facts from papers. Since we are finding and uploading mine-able papers, you could imagine some sort of suggestion system which says to an editor “you cited [fact x] from this paper, do you also want to cite [extracted facts] in the Wikipedia article too?”. Let’s work to make that system a fact in itself.

Wiki-Class Set-up Guide and Exploration

Best viewed with IPython Notebook Viewer


Wiki-Class Set-up Guide and Exploration

Wiki-Class is python package that can determine the quality of a Wikipedia page, using machine learning. It is the open-sourcing of the Random Forest algorithm used by SuggestBot. SuggestBot is an opt-in recommender to Wikipedia editors, offering pages that need work which look like pages they’ve worked on before. Similarly, with this package, you get a function that accepts a string of wikitext, and returns a Wikipedia Class (‘Stub’, ‘C-Class’, ‘Featured Article’, etc.). Wiki-class is currently in alpha according to its packager and developer [@halfak](, and although I had to make a few patches to get some examples to work, it’s ready to start classifying your wikitext.


  1. Setting it up on Ubuntu.
  2. Testing the batteries-included model.
  3. Using the output by introducing a closeness measure.
  4. Testing making our own model.


At first you may be frustrated to learn that Wiki-Class is Python 3 only. You’ll not be able to mix it with pywikibot, which is Python 2.7 only, and that can also mean upgrading some of your other tools. However just try to recall these update gripes next time you encounter a UnicodeError in Python 2.x; and then be thankful to Halfak for making us give Python 3 a try. I outline getting the environment running in Ubuntu 14.04 here.

Firstly, if you want to use the Ipython notebook with python3 you can do so with apt-get. And while we’re at it, for convenince we’ll also install another version of pip for Python 3.

In [95]:
!sudo apt-get install ipython3-notebook python3-pip
[sudo] password for notconfusing: 

Some requirements of Wiki-class, including sklearn, and nltk, which are a pain with Python 3 since they haven’t been properly packaged for it yet. So these you’ll have to get from source:

In [1]:
!pip3 install git+
!pip3 install git+

Making some random pages for a test dataset

We’ll need to get some Wikitext, with associated classifications, to start testing. I elected to make a random datasetin pywikibot, which as already stated is Python 2.7 only, and thus needs to be in a separate notebook, you can view it on the nbviewer still. Its output is a file test_class_data.json (github link of the bzip) which is just a dictionary associating qualities and page-texts.

Warning, this dataset has some examples that can cause a ZeroDivisonError because some of these pages have 0 non-mark-up text. I wrote this patch which fixes this issue.

Testing the Pre-built Model

In [3]:
import json
import pandas as pd
from wikiclass.models import RFTextModel
/usr/local/lib/python3.4/dist-packages/pandas/io/ UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.
  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))

Each model is stored in a .model file. A default one is included in the github repo.

In []:
In [35]:
!mv enwiki.rf_text.model\?raw\=true enwiki.rf_text.model

Now we load the model.

In [4]:
model = RFTextModel.from_file(open("enwiki.rf_text.model",'rb'))
In [5]:
classed_items = json.load(open('test_class_data.json','r'))
print(sum([len(l) for l in classed_items.values()]))

The Wiki-Class-provided model only deals with ‘Stub’, ‘Start’, ‘B’, ‘C’, ‘Good Article’, and ‘Featured Article’ classifications. It does not include not ‘List’, ‘Featured List’, or ‘Disambig’ class pages. So we have to sort out the standard classes out of our 38,000 test articles.

In [6]:
standards = {actual: text for actual, text in classed_items.items() if actual in ['Stub', 'Start', 'C', 'B', 'GA', 'FA'] }
In [5]:
print(sum([len(l) for l in standards.values()]))

Now we iterate over our 36,000 standard-class pages, and put their Wiki-Class assessments into a DataFrame.

In [6]:
accuracy_df = pd.DataFrame(index=classed_items.keys(), columns=['actual','correct', 'model_prob', 'actual_prob'])
for actual, text_list in standards.items():
    #see if actual is even here, otherwise no fair comparison
        for text in text_list:
                assessment, probabilities = model.classify(text)
            except ZeroDivisionError:
                #print(actual, text)
            accuracy_df = accuracy_df.append({'actual': actual,
                                              'correct':int(assessment == actual),
                                              'model_prob': probabilities[assessment],
                                              'actual_prob': probabilities[actual]}, ignore_index=True)

What you see here is that the output of an assessment is really two things. The ‘assessment’ which is simply the ‘class’ which the algorithm predicts best, but secondly a dictionary of probablities of how likely the text is to belong to each class.

In our DataFrame we record four data. The ‘actual’ class as Wikipedia classes it; whether the actual class matches the model prediction. The probabilty (read: “confidence”) of the model prediction. And lastly the probability of the actual class. Note in the “correct” case model_prob and actual_prob are the same.

In [7]:
df  = accuracy_df.dropna(how='all')
actual correct model_prob actual_prob
18 Start 0 0.4 0.0
19 Start 1 0.8 0.8
20 Start 0 0.4 0.0
21 Start 0 1.0 0.0
22 Start 1 0.7 0.7

If we look at the correct mean averages we should hopefully see something above 1/6th, which would be the performance of just guessing. Which we do.

In [8]:
groups = df.groupby(by='actual')
B         0.247391
C         0.278138
FA        0.854167
GA        0.444444
Start     0.387334
Stub      0.698394
Name: correct, dtype: float64

See how “close” predications are if they are not correct.

Now we hack on the output. The Random Forest is really just binning text into difference classes, it doesn’t know that some of the classes are closer to each other than others. Therefore we define a distance metric on the Standard Wiki classes. I call this order the “Classic Order” To get an intuition, consider this example. If an article is a Good Aritcle and the model prediction is also Good Article then it is off by 0; if the model prediction is Featured Article it is off off by 1; if the model prediction is Start then it was off by 3.

In [7]:
classic_order = ['Stub', 'Start', 'C', 'B', 'GA', 'FA']
enum_classic = enumerate(classic_order)

for enum, classic in dict(enum_classic).items():
    print(enum, classic)
0 Stub
1 Start
2 C
3 B
4 GA
5 FA

Now we are going to iterate over the same dataset as above, but instead of recording “correctness”, we record the closesness in a DataFrame.

In [8]:
classic_order = ['Stub', 'Start', 'C', 'B', 'GA', 'FA']
classic_dict = dict(zip(classic_order, range(len(classic_order))))

off_by_df = pd.DataFrame(index=classed_items.keys(), columns=['actual','off_by'])

for classic in classic_order:
    for text in standards[classic]:
                assessment, probabilities = model.classify(text)
            except ZeroDivisionError:
                #print(actual, text)
            off_by_df = off_by_df.append({'actual': classic,
                                              'off_by':abs(classic_dict[assessment] - classic_dict[classic])}, ignore_index=True)

So it should look something like this as a table

In [9]:
off_by  = off_by_df.dropna(how='all')
actual off_by
18 Stub 2
19 Stub 1
20 Stub 0
21 Stub 0
22 Stub 0

And as a chart.

In [10]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib

WARNING: pylab import has clobbered these variables: ['text']
`%pylab --no-import-all` prevents importing * from pylab and numpy

We can see that the middle classes are less easy to predict where as the ends are easier. This would corroborate our expectations. Since the the quality sprectrum bleed past these rather arbitrary cut-off points,ore of the quality specturm would lie in these intervals, and so its easier to bin them.

In [11]:
ax = off_by.groupby(by='actual',sort=False).mean().plot(title='Prediction Closeness by Quality Class', kind='bar', legend=False)
ax.set_ylabel('''Prediction Closeness (lower is more accurate)''')
ax.set_xlabel('''Quality Class''')
<matplotlib.text.Text at 0x7fc089810550>

Making a model

Now we test the model-making feature. We will use our dataset of ‘standards’ from above, using a random 80% for training and 20% for testing.

In [27]:
from wikiclass.models import RFTextModel
from wikiclass import assessments

Divvyig up our data into two lists.

In [28]:
import random

train_set = list()
test_set = list()
for actual, text_list in standards.items():
    for text in text_list:
        if random.randint(0,9) >= 8:
            test_set.append( (text, actual) )
            train_set.append( (text, actual) )


And the next step is quite simple, we just click a button supplying our train_set list, and test by supplying our test_set list. Also the package conveniently supplies a saving function for us to store our model for later use.

In [29]:
# Train a model
model = RFTextModel.train(

# Run the test set & print the results
results = model.test(test_set)

# Write the model to disk for reuse.
model.to_file(open("36K_random_enwiki.rf_text.model", "wb"))
pred assessment    B    C  FA  GA  Start  Stub
real assessment                               
B                130   29   1   5    105    40
C                 34  112   0   2    151    33
FA                 7    3   4   0      1     0
GA                 8    8   0  11      9     1
Start             80   87   0   2   1420   525
Stub              40   32   0   0    547  3973

Now to look at accuracy, we norm the DataFrame row-wise.

In [30]:
norm_results = results.apply(lambda col: col / col.sum(), axis=1)
pred assessment B C FA GA Start Stub
real assessment
B 0.419355 0.093548 0.003226 0.016129 0.338710 0.129032
C 0.102410 0.337349 0.000000 0.006024 0.454819 0.099398
FA 0.466667 0.200000 0.266667 0.000000 0.066667 0.000000
GA 0.216216 0.216216 0.000000 0.297297 0.243243 0.027027
Start 0.037843 0.041154 0.000000 0.000946 0.671712 0.248344
Stub 0.008711 0.006969 0.000000 0.000000 0.119120 0.865200

And finally we can view the peformance by class, which intriguingly seems to be better than what we got with the batteries-included model.

In [35]:
for c in classic_order:
    print(c, norm_results.loc[c][c])
Stub 0.865200348432
Start 0.671712393567
C 0.33734939759
B 0.41935483871
GA 0.297297297297
FA 0.266666666667

We can see that, having a large number of stubs to train on really gives us a high precision in classifying them.

So there you have it – a brief playing around with Wiki-Class, an easy way to get rough quality estimates out of your data. If you extend any more examples of using this class, I’d be intrigued to see and collaborate on them.


In []:

What Part of “School” Don’t You Understand?

I received an apologetic email from HackerSchool an hour ago, that was sorry to tell me they couldn’t admit me this fall – quizzically I was not gutted. HackerSchool is part of the wave of “Hacker Education,” where you exchange something with a company for programming education. HackerSchool differentiates in that you don’t pay them upfront, or necessarily at all – they just want a cut of a potential recruiting bonus when they pawn you to another company. They also have good perspectives on lightweight social rules and gender equality which piqued me.  Still, let us not mince, this is private education. A more dedicated Hacker might call it a co-option of DIY, gift-economy culture.

Although this might seem a bitter and fruitless retaliation in response to a rejection letter, that is only the first lily-pad. In fact, there is something more that made HackerSchool particularly attractive to me, which only became apparent in retrospect. As I wrote in my application (which is copied in its entirety below), and previously about dogs, a central conundrum for me is not knowing how to work for myself. I can work to impress authorities, appear clever for narcissistic purposes, or for fear of failure – but not because I want to.

HackerSchool’s “everyone determines their own lesson plan” philosophy could be a vital stepping stone to a dreamed-of autodidacticism. On the face of it, going to HackerSchool even looks like genuine autodidacticism. But closer inspection would reveal that you have an authority (the HackerSchool institution) that instructs you to teach yourself. The outer loop, the most meta-level, is still a deference to a force that isn’t your own. It’s virtualizing self-ownership, which is really just “bluepilling” yourself.

It's virtualizing self-ownership, which is really just
It’s virtualizing self-ownership, which is really just “bluepilling” yourself.


This conflict arose during my 14 minute interview with  the organizers, in the question of “how would I learn,” my favoured topics? “I suppose, I would use textbook as in College,” I replied without confidence, and later amended to “but typically it’s been project-needs plus stackoverflow.”  In both cases I now see I pointed to historic examples where the main motivator was either a professor or boss. Unsurprisingly this was a lacking answer to both myself, and an interviewer not looking to be my professor or boss.

Looking for positive cases of escaping this servitude, there is obviously one classical logic. Accomplishment-desire drives non-authority-pleasing mechanisms of work. But what if we allow that natural curiosity is not the only way to exit the teach-yourself-to-teach-yourself paradox. What could be alternatives? We could use as a starting point the goal vs. process attitude dichotomy. In this framework the ravenous prodigy sits neatly on the “goal” side. And on the other side?

There isn’t a prominent model to represent the unspurred, successful process-worshipper. The best exmaples I can offer are probably something like Aaron Schwartz, Grigori Perelman, or a stereotypical monk. Having such a dearth of role models is probably because process-oriented people aren’t highly lauded in our prize-counting society, and are thus non-notable. This is a dead-end I feel I’ve been running into frequently.

The conclusive feeling here is not directed, but is still a redoubling of effort. It’s a large, and still partially free internet out there. There’s Open Access research to read and write, and Open Source code to execute and develop. Even without the promise of coming to an epiphany of how not to get depressed about the fact that I do it alone in my bedroom, that unlit corridor still calls to me as the one with light at the end.

Should it help anyone fulfil their dreams, and for the sake of radical transparency, this was my HackerSchool Application.

HackerSchool Application


Please include any that you have: GitHub, LinkedIn, personal web site, etc. Any tips for updating? (trying to fix width issue on category subpages may need to switch thee)

Code CracklePop

Write a program that prints out the numbers 1 to 100 (inclusive). If the number is divisible by 3, print Crackle instead of the number. If it’s divisible by 5, print Pop. If it’s divisible by both 3 and 5, print CracklePop. You can use any language.


rice = [3,5]

crispies = ['Crackle','Pop']

rice_crispies = dict(zip(rice, crispies))

for i in range(101):


    for flake in rice:

        if i % flake == 0:

            print(rice_crispies[flake], sep='', end='')

            print('', end='\n')


Please link to a program you’ve written from scratch.

You can use something likeGitHub’s gist to host your code. It doesn’t need to be long, but it should be something you’ve written yourself, and not using a framework (e.g., Rails). If you don’t have anything to submit, code something small, like a game of tic-tac-toe.


This tutorial translates an Economic algorithm into Python. In short, it does some matrix-calculations, statistical analysis, and some plotting. Its most advanced language-feature is a python “generator.”

What is the most fascinating thing you’ve learned in the past month?

This doesn’t have to be about programming.


We all know a averaged crowd of fair-goers can guess the weight of a heifer more accurately than any of the individual simpletons among them. Science shows the principle extends to marbles and encyclopedias as well. But what I picked up this month at the Network Science Conference ‘14, was that it can be applied to stock trading too. Diversification strategies work – but they can also be diversified. A team of researchers explained a technique that they simulated. If you followed x traders, mimicking exactly the trades they perform, but with 1/x of your money, then for sufficiently high x, the return is higher than any of the individuals.


The network science bit comes in because you don’t want anyone you follow to be following each other. For the highest return on investment, those who you follow should have “no common ancestors,” in network parlance.


More so than stock trading, the “wisdom of the crowds” theory appeals to me. Trying to make clever stock decisions is a huge industry, and this intuitive simple mechanism can compete with more complex ones. What’s fascinating here to me is how theories can unexpectedly translate between domains.

What do you want to be doing in two years?


Two years from now I would like be swimming through the gooey centre of a large research project at a think tank or in research and development. Stemming from my previous employment at OCLC Research (a library think tank), I enjoy the freedom of blue-sky thinking. Therefore the employers that have a large enough budget for pure research (Microsoft Research is a good example of this) are the competitive waters that I want to compete in. Having such lofty dreams are never regrettable in my experience because there are always failsafes. In this case one can always sell oneself as a Data Analyst for business intelligence.


To enjoy any future work however it would be crucial for me to be in a team of stellar collaborators. My personal adage (which I stole from a guy that works in a copy shop) is “life is the water cool, the water cooler is life.” Being around people ignites my mind (even at the copy shop), and I want to continue fuelling that fire. I will continue to invite uncomfortable differences in perspective. Therefore in two years I want to be in a team that values learning over goals. Goals inevitably follow learning – but not vice-versa.

Why do you want to do Hacker School?

I see Hacker School as the centre part of a venn diagram of my desires which are (1) learning self-directedly (2) being part of a supporting group, and (3) boosting employment opportunity after.


Last year when I went to my boss and asked her to crack the whip on me harder, my own actions perplexed me. I quit my job to attack the problem of relying on authority to motivate my work. But next came the paradox: how can one self-direct one’s self to autodidactically become self-directed? Recursion without a base case.


From a pragmatic perspective I still go to my local hackerspace because I enjoy what could be termed as “co-learning.” The social environment drives me. Being conscious of your impression on others, can psychologically push you to work. It’s not self-actualizing alone in your bedroom, but it’s effective.


Hacker school seems like a realpolitik compromise between bootstrapping self-ownership, and well-proved social dynamics. Given that Hacker School can also help with the personally-dreaded task of a job-search after, I see a trifecta being won.


What would you like to work on at Hacker School?

E.g., things you want to learn or understand better, projects you want to build or contribute to, etc.


While there are a few pet-projects that jump to mind, none are as important as the process of the work I might do. Rather than pronounce any work in detail, I would describe my desires declaratively. There are two main criteria. Firstly, like a carrot just beyond horse-mouth’s reach, I want to find a project that is harder than I expect, simply to level-up. Secondly, is to overcome the folly of the lone-inventor myth. To horizontally work with a partner is as important as being the bringer of the techno-revolution. So the thing I would like to work on is a new idea I would receive while I’m at Hacker School.


That being said, in absence of any external input, a few of the topics I want to understand better are machine learning, genetic algorithms, and pattern recognition. These corners of computer science are somehow just cool. Pursuing them, since they are substantially complex seems commensurate with Hacker School motto of “get dramatically better.”


Also I want to be able to make my phone turn off silent mode by sending it a secret text for those hidden-under-couch situations.

Programming background

This information will not disqualify your application. We use it to better get to know our applicants and where they currently are. If you’re worried that you won’t fit into Hacker School, you can read aboutsome of our alumni.

Describe your programming background in a few sentences.

2006. Failed Java in Community College.

2009. Discover I enjoyed programming Turing Machines on paper in “Computability Theory” in my Pure Math major.

2010. Enroll in – and revel in – the purity of the Berkeley/MIT Scheme tradition.

2011. Fail Java again. Tech career funeral and wake.

2012. Phoenixed with Python + Stackoverflow, to write Wikipedia bots.

2013. Welcome to the FOSS movement. Linux and git start unpacking in my brain.

2014. Hacker School


Have you worked professionally as a programmer?

If so, please describe your experience.


Working in programming and working hard at reinventing the idea of a “professional” programmer have been the last three years of my life. When I was “Wikipedian-in-Residence” I turned my job into programming by convincing management of my proposed bot-writing projects. In my own business I’ve won contracts to deliver reports that were the result of custom programs. So although I’ve never worked as a typical professional programmer, I like my life to be about delivering code for pay.


Do you have a Computer Science degree or are you seeking one?


I have a Bachelor’s degree in Mathematics from University of California Berkeley, and have applied and furthered my Computer Science knowledge outside of academia. In the far future I have considered enrolling in a  Master’s or PhD program. My draw towards a heavily mathematical emphasis looms. From my work with Wikipedia, a more human and social element has nestled in my head. Therefore it’s possible that my interests would converge in Computer Science degree.

Logicomix, page 162.

Prerequisite-free Set Theory – Just The Intuition

Logicomix, page 162.
Logicomix, page 162.

My favourite Hackerspace Sudo Room  is very close to  Bay Area Public School,  whose concept of a anti-capitalist University intrigues me very much. In chatting  about their plans for Math education, they expounded on the need for a primer to Set Theory, as they had been learning the Philosophy of Alain Badiou, who utilizes those foundations. Their request was for softer, more intuitive introduction. And just a short 18 months after that casual chat, this last, Saturday June 14th 2014, I held that public education, and it went brilliantly. 2 very curious mind showed up and we had fun reading the comic example aloud. The comic we used as a launching point is Logicomix: An Epic Search for Truth.

Continue reading

See how the Method of Reflections evolves as a recursive process.

Method of Reflections: Explained and Exampled in Python

The introduction of post is mirrored here, but the full tutorial is on IPython Notebook Viewer.

Method of Reflections Explained and Exampled in Python


See how the Method of Reflections evolves as a recursive process.
See how the Method of Reflections evolves as a recursive process.

The Method of Reflection (MOR) is a algorithm first coming out of macroeconomics, that ranks nodes in a bi-partite network. This notebook should hopefully help you implement the method of reflection in python. To be precise, it is the modified algorithm that is proposed by Caldarelli et al., which solves some problems with the original Hidalgo-Hausmann (HH) algorithm doi:10.1073/pnas.0900943106. The main problem with (HH) is that all values converge to a single fixed point after sufficiently many iterations. The Caldarelli version solves this by adding a new term to the recursive equation – what they call a biased random walker (function G). doi: 10.1371/journal.pone.0047278 . I hadn’t seen any open-source implementations of this algorithm, so I thought I’d share my naïve approach.

Read on at

Continue reading