# Kumusha Takes Wiki: Actionable Metrics for Uganda and Côte d’Ivoire

Live version available with github and IPython nbviewer.

# Investigating Good Articles for Côte d’Ivoire and Uganda¶

## For Kumusha Takes Wiki – by Max Klein 10 January 2014¶

Our principal question in this investigation is what makes a good article about a country in the hope of improving articles related to Côte d’Ivoire and Uganda

We will explore several aspects of the quality of the articles and their relationship to different language Wikipedias. First we look at articles literally describing the countries of Ivory Coast and Uganda and compare them to other country articles in English, French and Swahili Wikipedias. This is done to both sanity check our method, and also to determine which subsequent articles to explore. Here we utilise the research Tell Me More: An Actionable Quality Model for Wikipedia by Warncke-Wang, Cosley, and Reidl which is meant to measure points of quality that can be more easily addressed. Secondly, we move to sets of articles about a subject of that nation. These articles are chosen to be most important by how often they occur in a specific Wikipedia. Next we attempt to apply the lessons from Predicting Quality Flaws in User-generated Content: The Case of Wikipedia by Anderka, Stein, Lipka which are about finding user-specified quality flaws.

This report is an IPython Notebook so it contains all the code necessary to reproduce these results. Much of the the code is at the end, to maintain non-technical readability.

## Getting the Wikidata item for every country¶

We want to compare literal country articles worldwide. Since we will be analysing English, French, and Swahili Wikipedias, we find the Wikidata items related to these countries (here denoted by their identifiers beginning with “Q”). Each Wikidata items points to all the different Wikipedia language versions available. In an attempt to not include language bias, this list was first queried from the Wikidata property “instance of” with target country (through the offline Wikidata parser ). However this only yielded 133 items. Even though it is Anglocentric our list is scraped from English Wikipedia List of Countries by Popultation, yielding:

In [5]:
country_qids = json.load(open('country_qids.json'))
ugandaqid, cdiqid= 'Q1036', 'Q1008'
print "We are comparing: ", len(country_qids), 'countries".'
print "Some of their Wikidata QIDs are: ", map(lambda qid: 'http://wikidata.org/wiki/'+qid, country_qids[:5])
We are comparing:  242 countries".
Some of their Wikidata QIDs are:  [u'http://wikidata.org/wiki/Q16', u'http://wikidata.org/wiki/Q33788', u'http://wikidata.org/wiki/Q25230', u'http://wikidata.org/wiki/Q874', u'http://wikidata.org/wiki/Q37']

## Let’s inspect each metric¶

We look at the 5 statistics that are mentioned in the literature as “Actionable Metrics”. The metrics are:

1. Completeness – This counts the number of intra-wiki links, to other articles.
2. Informativeness – This is a combination of the ratio of (readable text to wikitext markup) plus number of images.
3. Numeber of Headings – A count of the sections and subsections.
4. Article Length – The lenght of the readable article in characters.
5. Reference Rate – The number of references per article length.

In Tell me more weights are used, and we use the same weights, but for expermientation they could be changed.

In [140]:
def report_actionable_metrics(wikicode, completeness_weight=0.8, infonoise_weight=0.6, images_weight=0.3):
informativeness = (infonoise_weight * infonoise(wikicode) ) + (images_weight * num_file_links(wikicode) )

return {'completeness': completeness, 'informativeness': informativeness, 'numheadings': numheadings, 'articlelength': articlelength, 'referencerate': referencerate}

We will now look at each metric one by one, on the set of our Wikidata Q-IDs to get a feel for how these metrics will be used later on.

For each metric the Uganda and Côte d’Ivoire article’s position amongst other countries are shown by the z-score which is a measure of how many standard deviations they are above or below the average. Additionally we display a graph of the distribution of how the Country articles score, separated by Wikipedia language.

In [499]:
metric_analyse_density('informativeness', 12)
Uganda (Q1036), Côte d'Ivoire (Q1008) informativeness z-scores.
Out[499]:
language en fr sw
Q1036 0.381577 -0.643210 -0.272597
Q1008 -0.431621 3.311523 -0.384729

Most notable here, Côte d’Ivoire has a very high score in French Wikipedia – 3.31 standard deviations above average.

In general these distributions are quite standard with a relatively long tail, also known as positive Skew, which we would expect from user generated content. Also English and French outpace a spikier Swahili, which fits in line with standard conceptions about the popularities of these languages.

In [500]:
metric_analyse_density('completeness', 1500)
Uganda (Q1036), Côte d'Ivoire (Q1008) completeness z-scores.
Out[500]:
language en fr sw
Q1036 -0.443975 -0.176491 1.736298
Q1008 -0.973200 1.767785 -0.505017

Reassuringly Uganda performs highly here in Swahili, meaning that it is quite intralinked to other articles.

In [501]:
metric_analyse_density('numheadings', 75)
Uganda (Q1036), Côte d'Ivoire (Q1008) numheadings z-scores.
Out[501]:
language en fr sw
Q1036 -1.695575 0.028298 0.023235
Q1008 0.660221 -0.411489 -0.427908

Here English appears to have almost a bimodal distribution, for which we are accepting your explanations of why that should be so.

In [505]:
metric_analyse_density('articlelength',150000)
Uganda (Q1036), Côte d'Ivoire (Q1008) articlelength z-scores.
Out[505]:
language en fr sw
Q1036 0.105196 -0.498248 0.190613
Q1008 0.335907 2.764950 -0.244492

Again Côte d’Ivoire scores a very high, 2.77, in French Wikipedia. It must be quite a long article.

In [503]:
metric_analyse_density('referencerate', 0.01)
Uganda (Q1036), Côte d'Ivoire (Q1008) referencerate z-scores.
Out[503]:
language en fr sw
Q1036 0.014790 -0.608334 3.129056
Q1008 -1.183966 1.663328 -0.265412

These results offer credibility to this techqnique because the Wikipedias of the native languages of those countries show the highest quality. For the Uganda articles performance is slightly below average in English and French but strong in Swahili. For the Côte d’Ivoire articles, the real stand out is French Wikipedia, at times 3 standard deviations above average, but again below average in other languages. So we can somewhat confidently apply this technique to other sets of articles so see what improvements could be made. Remember a weakness in any one of the 5 metrics is an indicator for a specific type of task that could be done on those articles. Next we will inspect how to determine other sets of articles to which we can apply this method.

## Highest Occuring Sections¶

Now we will find the hightest occuring section names in Country Articles by language. The assumption we use here is that if a section name is highly occurring it is an important subject in that language.

In [113]:
top_n_sections('fr',20)
Out[113]:
lang secname freq
1941 fr liens externes 0.904959
218 fr histoire 0.884298
1343 fr géographie 0.826446
878 fr notes et références 0.809917
990 fr économie 0.785124
847 fr démographie 0.752066
1376 fr politique 0.685950
710 fr voir aussi 0.685950
2074 fr articles connexes 0.648760
1556 fr codes 0.644628
1350 fr culture 0.640496
1637 fr bibliographie 0.491736
641 fr subdivisions 0.301653
94 fr climat 0.301653
1030 fr langues 0.289256
800 fr sport 0.243802
602 fr éducation 0.239669
1613 fr religions 0.231405
1485 fr religion 0.219008
1461 fr santé 0.198347
In [114]:
top_n_sections('en',20)
Out[114]:
lang secname freq
3168 en history 0.954545
2688 en economy 0.888430
3218 en demographics 0.876033
2838 en geography 0.822314
3013 en references 0.805785
3356 en culture 0.797521
3492 en religion 0.657025
4049 en education 0.648760
2315 en climate 0.545455
2153 en military 0.462810
3046 en politics 0.438017
2652 en sports 0.396694
2250 en foreign relations 0.384298
3577 en languages 0.367769
3422 en cuisine 0.363636
3086 en health 0.359504
In [506]:
top_n_sections('sw',7)
Out[506]:
lang secname freq
4122 sw {{infobox country 0.510730
4260 sw viungo vya nje 0.484979
4371 sw historia 0.412017
4295 sw jiografia 0.334764
4077 sw tazama pia 0.184549
4339 sw uchumi 0.120172
4245 sw wakazi 0.103004

Looking at the most frequent sections in each what patterns emerge? First let us acknowledge that some of the sections here are not conent – like “Notes” and “References”. The ‘leader’ section – the section before any named sections – usually startts with a code to include the Infobox Country in Swahili, which is where that cryptic result comes from. Ignoring non-content sections though there is a clear Winner in all three languages – History.

In all three languages History is the one section you can most rely on being there, occurring in 88% of French, 95% of English and 41% of Swahili country articles. In Swahili and French, the second most popular sections are Geography, but in English it’s Economy. So this informs us on which Wikipedia Categories to apply our metrics on next the: History, Economy, and Geography categories.

## Subject and Nation Analysis¶

Given that we now have the top sections for each language, we devise a method to look at representative articles. We find a Category in each Wikipedia for each subject in our highest occurring section headings and for each major nation associated with our languages. We create a script to present every article in every category and all its recursive subcategories to a human judge. The full list of categories for the second part of this analysis comes from is:

In [266]:
categoryjson = json.load(open('ethnosets-categories-capitalized.json','r'))
for subject, nationdict in categoryjson.iteritems():
print subject
for nation, langdict in nationdict.iteritems():
print " |---"  + nation
for lang in ['fr','en','sw']:
try:
print"    |---" + langdict[lang]
except:
pass
Geography
|---Côte d'Ivoire
|---Catégorie:Géographie_de_la_Côte_d'Ivoire
|---Geography_of_Ivory_Coast
|---Jamii:Jiografia_ya_Cote_d'Ivoire
|---USA
|---Catégorie:Géographie_des_États-Unis
|---Category:Geography_of_the_United_States
|---Jamii:Jiografia_ya_Marekani
|---Uganda
|---Catégorie:Géographie_de_l'Ouganda
|---Category:Geography_of_Uganda
|---Jamii:Jiografia_ya_Uganda
|---France
|---Catégorie:Géographie_de_la_France
|---Category:Geography_of_France
History
|---Côte d'Ivoire
|---Catégorie:Histoire_de_la_Côte_d'Ivoire
|---Category:History_of_Ivory_Coast
|---Uganda
|---Catégorie:Histoire_de_l'Ouganda
|---Category:History_of_Uganda
|---Jamii:Historia_ya_Uganda
|---USA
|---Catégorie:Histoire_des_États-Unis
|---Category:History_of_the_United_States
|---Jamii:Historia_ya_Marekani
|---France
|---Catégorie:Histoire_de_France
|---Category:History_of_France
|---Jamii:Historia_ya_Ufaransa
Economy
|---Côte d'Ivoire
|---Catégorie:Économie_de_la_Côte_d'Ivoire
|---Category:Economy_of_Ivory_Coast
|---Jamii:Uchumi_wa_Cote_d'Ivoire
|---USA
|---Catégorie:Économie_des_États-Unis
|---Category:Economy_of_the_United_States
|---Jamii:Uchumi_wa_Marekani
|---Uganda
|---Catégorie:Économie_de_l'Ouganda
|---Category:Economy_of_Uganda
|---Jamii:Uchumi_wa_Uganda
|---France
|---Catégorie:Économie_de_la_France
|---Category:Economy_of_France
|---Jamii:Uchumi wa Ufaransa

Looking at every title of every article in every subcategory of these categories, a human (your author) accepted or rejected each article. The purpose of this filtering process was to remove noise, such as articles about economists out of Econom, or highschools out of Geography, while being able to select articles from many diverse categories. In the end we accepted a total of:

In [271]:
ethnosets = !ls ethnosets/
reduce(lambda a, b: a+b, map(lambda l: len(l), map(lambda f: json.load(open('ethnosets/'+f)), ethnosets)))
Out[271]:
24630

24630 Articles which were fetched as live data off Wikipedia. Live so that we could compare the results after any editing efforts. We used Wikimedia labs for the data pull script if you are interested. Then for each of those articles we ask Wikidata to give us the page if possible in all our desired languages. And then on all those pages we call the Actionable Metrics as described earlier.

If you are keeping track that means we have four active variables:

1. the Nation,
2. the Wikipedia language,
3. the subject, and
4. the metric type.

Below is a matrix of heatmaps displaying the mean average – not the full distribution as we previously graphed – of the specified metrics versus specified subjects. In the inner dimensions we graph across all pages in a specified language versus a specified nation. That is If you look at any individual heatmap you can see it compares a Wikipedia language and a nation. Looking down the rows of the greater matrix we have varying metric types, and looking down the columns of the greater matrix we have the subject categories.

In [146]:
make_heat_map()

### Analysis¶

This heatmap can act as a guide to where one could find good examples of articles. Say for istance you wanted to improve History articles of Uganda. First look at the vertical History column and within it, the vertical Uganda column. Then moving your eyes up and down you can see that while English Wikipedia has the best reference rate, French Wikipedia has a slightly higher code & images score, which would be differently useful depending on how you wanted to improve these articles.

For Economy, we can see that the clearest horizontal band across each metric is English. This is a promising result because we knew that English unlike French or Swahili placed more of an emphasis on Economy in their country articles. However if we take a by-nation approach we see that the bluest square is almost always in the France column, not in the USA column. Yet, wouldn’t we expect that if English values Economy, that Economy of the USA articles would perform most strongly? One exlpanation is that ther are so many English USA company articles that infact the average quality is brought down. Some data that would give creedence to this idea is that English articles about France’s economy outperform in the code and images metric. Often bot opertions will leave a lot of templated code, on many relatively obscure articles for which this could be a trace. For instance in Economy articles then, it would be recommended to look at French articles of France for written quality, and English for templating ideas.

Looking in the History column, in each of the metric-rows there is strong horizontal band of blue in the French sub-row. This indicates that French history articles are good across most nations. Except in the bottom two metrics, Article length and references per article, English seems to outperform, particularly for Uganda and the USA. Swahili here performs better for the USA than for Uganda. In fact in the Swahili band, many datapoints do not register because there are 25 or less articles in question (see caveat below). However within the Uganda column, English and French do register, we can interpret these as translation opportunities to get more Swahili content. As for Côte d’Ivoire, we also find that in all but referencing French does better than English. This would suggest that for Côte d’Ivoire’s history, they unfortunately have no better direct examples to follow (but they do have French history to look up to).

Moving to Geography, a notable pattern is the domination of French Wikipedia about France in – again – all but the referencing metric. We encounter some movement from Swahili Wikipedia for the first time. You can see two fainter, but still noticeable vertical strips in Côte d’Ivoire and Uganda for all Wikipedia languages. So actually this means that Geogrpahy is the relative strong point for these African countries. To improve their relative content coverage Swahili editors would do better to focus on Economy and History if they wanted to work on the more-needed areas of their Wikipedia. And within all the things they could do, we see the bluest points within code and images, which is likely from bot-created Africa place articles. The needed places to improve those articles are section count and article length which for English and French about France, and English about USA could best be a guide.

#### Caveats¶

Since we are using the mean average, sometimes if there are not a lot of pages in a language for a category, if one of the pages is particularly impressive then the average can appear high without it meaning that language is a useful information source. This occurred for instance Swahili Economy of France in Wikilinks. We set a threshold minimum number of contributing articles in a heatmap to 25 to reduce this noise, so a white square will appear if there were less than 25 articles. All subject-nations not meeting this threshold can be found in the code appendix.

### Top Templates by Subject and Nation¶

The previous approach of looking at the top section heading, was not very useful here because within a broad category like “Economy” or “Geography” there are too many sub-subjects which have different expected flavours of section headings. The code appendix shows the results of this experimentation. Instead we look at the top templates, to see if sepecific quality flaws could be identified using the techniques of Stein’s Quality Flaws. Regretably this study only disucsses English Wikipedia, so here we would need a French or Kiswhaili Wikipedian to identify the top Clean-up tags in French Wikipedia. While “Citaiton neeed” does appear for some categories neither “Citation nécessaire” nor “Ukweli” crops up, so this indicates maybe a direct translation is not possible.

In fact all of the most common clean-up tags make no apperance in our top 20 of each subject-nation files, except “Citation needed”. These subject-nation combinations have the highest incidence of “Citation needed”:

Subject-nation| Frequency of "citation needed" (per article)

history-usa   |  0.501708
economy-fra   |  0.282183
economy-usa   |  0.270058
history-fra   |  0.226929
history-cdi   |  0.173913
geography-usa |  0.154013
economy-cdi   |  0.128205

This probably highlights lack of tagging with clean-up templates more than anything. It seems counter-intuitive that the more highly scoring subject-nations would also need more citing that those which scored less is all other areas. Our sample of Wikipedia articles here is too small for, or is otherwise somehow not the right application for this cleanup tag analysis.

Still, looking over the data one can see that other citation templates feature highly. Therefore we can investigate the relative types of Citations used. We do this only in English Wikipedia for lack of understanding of French and Swahili citation philosophies.

First we retrive the statistics for all occurences of English Wikipeda Citation templates from Wikimedia labs, and then convert them into a proportion.

MariaDB [enwiki_p]> select tl_title, count(*) from templatelinks where tl_title like 'Cite_web' or tl_title like 'Cite_news' or tl_title like 'Cite_journal' or tl_title like 'Cite_book' group by tl_title;
+--------------+----------+
| tl_title     | count(*) |
+--------------+----------+
| Cite_book    |   536258 |
| Cite_journal |   328129 |
| Cite_news    |   444447 |
| Cite_web     |  1560207 |
+--------------+----------+
4 rows in set (11 min 36.68 sec)

This global statistic, serves as a benchmark for all the compositions we can infer from our Top template work if we consider all the occurences of Cite book, Cite Journal, Cite News, and Cite web from our subject-nation files. See below the graphical reperesentation.

In [125]:
make_cite_plot()

Immediately Geography seems to be less diversely cited for all nations. We can also see signs that Economy on the whole has more news citations, and that History on average utilises more book citations. However the more major fact seems to be that Web citations are almost always more than half. Notably, when they are not, we have 2 out of 3 being heavy news citations in History of Côte d’Ivoire and History of Uganda.

All of this should be quite reassuring. A main disadvantage facing African content, previously postulated, was the lack of citable sources. That problem is exacerbated if we consider only printed historical works as useful to expanding content. However since across the board citations are not heavily reliant on books or journals – even in the Encyclopedia’s strong suites like the Economy of the USA – citing should be less of an impediment to Kumusha Takes Wiki editors.

## Conclusions¶

Our main task was to understand what makes a good article with respect to a nation. Despite the question’s subjectivity, there are facts we have shown about the current state of Wikipedia, and if you consider any of them “useful” then you can try to emulate them. Our main findings are that when it comes to a encyclopedically desrcibing a nation, all of English, French and Swahili Wikipedias most often write about the nation’s history. French and Swahili then consider Geography next most important, while English thinks it Economy. We have created a graphical guide involving our 3 languages, 3 subjects, 4 nations, and 5 metrics. One can consult this guide to know where work needs to be done, and which areas of our Wikipedias to look for as an example. Lastly we determined that we do not have much information about where users have left requests for improvement to the countries in question, but if English Wikipedia is a model, then Web citations are usually sufficient.

### Further Directons¶

The way in which the swaths of articles were chosen for the subject-nation sets could be unsatisfactory for several reasons. Firstly the category system may not accurately represnt a Wikipedia’s available items on a subject. Secondly, since the process involved a human judge, error is certainly introduced. A better way of determining these subject-nation sets would be useful. Also, no by-hand human-reading investigation was done on those sets, instead we opted for algorithmic methods. If a sound methodology for the human analysis of pages is available, that would be a good technique to compare to the algorithmic ones presented here.

## Start of Supporting Code¶

In [111]:
#Infonoise metric of Stvilia (2005) in concept, although the implementation may differ since we are not stopping and stemming words, because of the multiple languages we need to handle

#could also use wikicode.filter_text()
return float(len(wikicode.strip_code()))

def infonoise(wikicode):
wikicode.strip_code()
return ratio

#Helper function to mine for section headings, of course if there is a lead it doesn't quite make sense.

sections = wikicode.get_sections()
sec_headings = map( lambda s: filter( lambda l: l != '=', s), map(lambda a: a.split(sep='\n', maxsplit=1)[0], sections))

#i don't know why mwparserfromhell's .fitler_tags() isn't working at the moment. going to hack it for now
import re
def num_refs(wikicode):
text = str(wikicode)
reftags = re.findall('<(\ )*?ref', text)
return len(reftags)

def article_refs(wikicode):
sections = wikicode.get_sections()
return float(reduce( lambda a,b: a+b ,map(num_refs, sections)))

#Predicate for links and files in English French and Swahili

fnames = [u'File:', u'Fichier:', u'Image:', u'Picha:']
bracknames = map(lambda a: '[[' + a, fnames)

cnames =[u'Category:', u'Catégorie:', u'Jamii:']
bracknames = map(lambda a: '[[' + a, cnames)

return float(len(file_links))
In [2]:
import pywikibot
import mwparserfromhell as pfh
import os
import datetime
import pandas as pd
import json
from collections import defaultdict
from ggplot import *
import operator

%pylab inline

langs = ['en','fr','sw']
nations = ['usa', 'fra', 'cdi', 'uga']

wikipedias = {lang: pywikibot.Site(lang, 'wikipedia') for lang in langs}
wikidata = wikipedias['fr'].data_repository()
VERBOSE:pywiki:Starting 1 threads...
Populating the interactive namespace from numpy and matplotlib
WARNING: pylab import has clobbered these variables: ['xlim', 'mpl', 'colors', 'ylim']
%pylab --no-import-all prevents importing * from pylab and numpy

this will be our data structure dict of dicts, until we have some numbers to put into pandas

In [143]:
def enfrsw():
return {lang: None for lang in langs}
def article_attributes():
return {attrib: enfrsw() for attrib in ['sitelinks', 'wikitext', 'wikicode', 'metrics']}

for qid in qids:
page = pywikibot.ItemPage(wikidata, qid)
wditem = page.get()
for lang in langs:
try:
except KeyError:
pass
return data

Functions to get the page texts in all our desired languages

In [494]:
def get_wikitext(lang, title):
page = pywikibot.Page(wikipedias[lang],title)
def get_page(page):
try:
pagetext = page.get()
return pagetext
except pywikibot.exceptions.IsRedirectPage:
redir = page.getRedirectTarget()
get_page(redir)
except pywikibot.exceptions.NoPage:
raise pywikibot.exceptions.NoPage
print 're raising'
return get_page(page)

def do_wikitext(langs, data):
for qid, attribs in data.iteritems():
if sl:
try:
if randint(0,100) == 99:
print sl
data[qid]['wikitext'][lang] = get_wikitext(lang, sl)
except:
continue
return data

def do_wikicode(langs, data):
for qid, attribs in data.iteritems():
for lang, pagetext in attribs['wikitext'].iteritems():
if pagetext:
data[qid]['wikicode'][lang] = pfh.parse(pagetext)
return data

def do_metrics(data):
for qid, attribs in data.iteritems():
for lang, wikicode in attribs['wikicode'].iteritems():
if wikicode:
data[qid]['metrics'][lang] = report_actionable_metrics(wikicode)
return data

# this will take a lot of network time since we are going to load about 300 pages, but we'll save the data off so we don't have to do it uneccesarrily

def make_data(langs, qids, savename):
print 'getting these qids: ', qids
data = defaultdict(article_attributes)
print 'getting wikitext'
data = do_wikitext(langs, data)
print 'converting to wikicode'
data = do_wikicode(langs, data)
print 'computing metrics'
data = do_metrics(data)

hashable_data = {qid:
{'wikitext': attribdict['wikitext'],
'metrics': attribdict['metrics'],
for qid, attribdict in data.iteritems()}
print 'saving now'
#save the results
safefilename = savename+str(datetime.datetime.now())+'.json'
with open(safefilename,'w') as f3:
json.dump(hashable_data,f3)
with open(savename+'latest.json','w') as f4:
json.dump(hashable_data, f4)
return data

#i don't call this unless i have time to uncomment it
#arts = make_data(langs, country_qids, 'countrydata')

#time to get into pandas, lets throw everything into a data frame

df = pd.DataFrame(columns=['Country','language','metric','val'])

for qid, attribdict in arts.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'metrics':
for lang, metrics in langdict.iteritems():
try:
#someteimes there wasn't an article in that language and thus no corresponding len
for metric_name, metric_val in metrics.iteritems():
df = df.append({'Country': qid, 'language':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
except:
pass
df = df.convert_objects(convert_numeric=True)

langs_df_dict = {lang: df[df['language'] == lang] for lang in langs}
metric_df_dict = {metric: df[df['metric'] == metric] for metric in metric_list}

#for later calculation
uganda_zscores = defaultdict(list)
cdi_zscores = defaultdict(list)
In [495]:
def metric_analyse_density(ametric, xlimit):
inf_df = metric_df_dict[ametric]

zscore = lambda x: (x - x.mean()) / x.std()

inf_piv = inf_df.pivot(index='Country', columns='language', values='val')

inf_piv_z = inf_piv.apply(zscore)
metric_analyse_density_plot(ametric, xlimit, inf_df)
print 'Uganda ('+ugandaqid+"), Côte d'Ivoire ("+cdiqid+") " +ametric+ " z-scores."
return inf_piv_z.ix[[ugandaqid,cdiqid]]
In [496]:
def metric_analyse_density_plot(ametric, xlimit, inf_df):
p = ggplot(aes(x='val', colour='language', fill=True, alpha = 0.3), data=inf_df) + geom_density() + labs("score", "frequency") + \
scale_x_continuous(limits=(0,xlimit)) + ggtitle(ametric + '\nall country articles\n                                                                        ')
p.rcParams["figure.figsize"] = "4, 3"
p.draw()
In [112]:
def defaultint():
return defaultdict(int)

section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)

for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
for sec in secs:
sec = sec.strip()
section_count[lang][sec] += 1

section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
for secname, seccount in sec_dict.iteritems():
freq = seccount/float(total_articles[lang])
section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)

#section_df = section_df.convert_objects(convert_numeric=True)
top_secs = section_df[section_df.freq > 0.1]

sort_secs= top_secs.sort(columns='freq', ascending=False)

def top_n_sections(lang,n):
return sort_secs[sort_secs.lang==lang].iloc[:n].convert_objects(convert_numeric=True)
In [9]:
def top_sections_ethnoset(ethnoset_filename):

def defaultint():
return defaultdict(int)

section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)

for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
for sec in secs:
sec = sec.strip()
section_count[lang][sec] += 1

section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
for secname, seccount in sec_dict.iteritems():
freq = seccount/float(total_articles[lang])
section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)

#section_df = section_df.convert_objects(convert_numeric=True)
top_secs = section_df[section_df.freq > 0.1]

sort_secs= top_secs.sort(columns='freq', ascending=False)

return sort_secs
In [137]:
def make_heat_map():

subj_list = ['economy','history','geography']
#pivtables = {metric: {subj: None for subj in subj_list} for metric in metric_list}

fig, axes = plt.subplots(nrows = len(metric_list), ncols = len(subj_list), sharex='col', sharey='row' )
'''
for metric, subjdict in pivtables.iteritems():
for subj, pivtab in subjdict.iteritems():
natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
pivtables[metric][subj] = natlangpiv

'''

for axarr, metric in zip(axes, metric_list):
for ax, subj in zip(axarr, subj_list):

natlangdf = means_df[(means_df.metric == metric) & (means_df.subj == subj)]
natlangpiv = pd.pivot_table(natlangdf, values='means', rows='lang', cols='nation')
heatmap = ax.pcolor(natlangpiv, cmap='Blues')
ax.set_yticks(np.arange(0.5, len(natlangpiv.index), 1))
ax.set_yticklabels(natlangpiv.index)
ax.set_xticks(np.arange(0.5, len(natlangpiv.columns), 1))
ax.set_xticklabels(natlangpiv.columns)
cbar = plt.colorbar(mappable=heatmap, ax=ax)

fig.suptitle('Heatmap of Actionable Metrics by Country versus Wikipedia Language, \n by Subject Category', fontsize=18)
fig.set_size_inches(12,12,dpi=600)
#fig.tight_layout()

subj_titles = ['Economy','History','Geography']
metric_titles =['Wikilinks','Code & Images to Text Ratio','Section Count','Article Length', 'References per Article Length']
for i in range(len(subj_titles)):
axes[0][i].set_title(subj_titles[i])
for j in range(len(metric_titles)):
axes[j][0].set_ylabel(metric_titles[j])
In [82]:
means_df[(means_df.metric == 'referencerate') & (means_df.subj == 'geography')]
Out[82]:
subj nation lang metric means
48 geography usa en referencerate 0.003102
49 geography usa fr referencerate 0.001050
50 geography usa sw referencerate 0.000012
51 geography fra en referencerate 0.000573
52 geography fra fr referencerate 0.001929
53 geography fra sw referencerate 0.000000
54 geography cdi en referencerate 0.004123
55 geography cdi fr referencerate 0.002138
56 geography cdi sw referencerate 0.006302
57 geography uga en referencerate 0.001590
58 geography uga fr referencerate 0.000560
59 geography uga sw referencerate 0.000114
In [135]:
def load_ethnosaves():
ethnosaves = !ls ethnosave

subj_df_dict = {subj: pd.DataFrame(columns=['qid','subj','nation','lang','metric','val']) for subj in ethnosaves}

for ethnosavefile in ethnosaves:
nameparts = ethnosavefile.split('-')
subj = nameparts[0]
dotparts = nameparts[1].split('.')
nation = dotparts[0]
print subj, nation
sdf = subj_df_dict[ethnosavefile]
for qid, attribdict in arts.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'metrics':
for lang, metrics in langdict.iteritems():
try:
#someteimes there wasn't an article in that language and thus no corresponding len
for metric_name, metric_val in metrics.iteritems():
sdf = sdf.append({'qid': qid, 'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric_name, 'val':float(metric_val)}, ignore_index=True)
except:
pass
subj_df_dict[ethnosavefile] = sdf
lens = map(lambda d: len(d), subj_df_dict.itervalues())
print lens
return subj_df_dict

subj_df = pd.concat(subj_df_dict)
assert(len(subj_df) == reduce(lambda a, b: a+b, map(lambda df: len(df), subj_df_dict.itervalues())))

subj_df = subj_df.convert_objects()
economy cdi
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 735]
economy fra
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14690, 735]
economy uga
[0, 0, 0, 495, 0, 0, 0, 0, 0, 0, 14690, 735]
economy usa
[0, 0, 0, 495, 0, 0, 0, 0, 0, 18550, 14690, 735]
geography cdi
[12495, 0, 0, 495, 0, 0, 0, 0, 0, 18550, 14690, 735]
geography fra
[12495, 0, 0, 495, 0, 26550, 0, 0, 0, 18550, 14690, 735]
geography uga
[12495, 0, 0, 495, 0, 26550, 0, 0, 5020, 18550, 14690, 735]
geography usa
[12495, 0, 43992, 495, 0, 26550, 0, 0, 5020, 18550, 14690, 735]
history cdi
[12495, 0, 43992, 495, 0, 26550, 995, 0, 5020, 18550, 14690, 735]
history fra
[12495, 0, 43992, 495, 0, 26550, 995, 22325, 5020, 18550, 14690, 735]
history uga
[12495, 0, 43992, 495, 2165, 26550, 995, 22325, 5020, 18550, 14690, 735]
history usa
[12495, 11340, 43992, 495, 2165, 26550, 995, 22325, 5020, 18550, 14690, 735]
In [145]:
means_df = pd.DataFrame(columns=['subj','nation','lang','metric','means'])

for subj in ['geography','history','economy']:
for nation in ['usa','fra','cdi','uga']:
for lang in ['en','fr','sw']:
spec_df = subj_df[(subj_df.subj == subj) & (subj_df.nation == nation) & (subj_df.metric == metric) & (subj_df.lang == lang)]['val']
mean = spec_df.mean()
if (not str(mean)[0] in '0123456789'):
mean = 0.0
if len(spec_df) <= 25:
print len(spec_df), subj, metric, nation, lang
mean = 0.0
means_df = means_df.append({'subj':subj, 'nation':nation, 'lang':lang, 'metric':metric, 'means':mean}, ignore_index=True)

means_df = means_df.convert_objects(convert_numeric=True)
6 geography completeness fra sw
6 geography informativeness fra sw
6 geography articlelength fra sw
6 geography referencerate fra sw
8 history completeness fra sw
0 history completeness cdi sw
3 history completeness uga sw
8 history informativeness fra sw
0 history informativeness cdi sw
3 history informativeness uga sw
8 history articlelength fra sw
0 history articlelength cdi sw
3 history articlelength uga sw
8 history referencerate fra sw
0 history referencerate cdi sw
3 history referencerate uga sw
10 economy completeness usa sw
5 economy completeness fra sw
1 economy completeness cdi sw
10 economy completeness uga fr
11 economy completeness uga sw
10 economy informativeness usa sw
5 economy informativeness fra sw
1 economy informativeness cdi sw
10 economy informativeness uga fr
11 economy informativeness uga sw
10 economy articlelength usa sw
5 economy articlelength fra sw
1 economy articlelength cdi sw
10 economy articlelength uga fr
11 economy articlelength uga sw
10 economy referencerate usa sw
5 economy referencerate fra sw
1 economy referencerate cdi sw
10 economy referencerate uga fr
11 economy referencerate uga sw
In [131]:
def top_sections_ethnoset(ethnoset_filename):
print ethnoset_filename

def defaultint():
return defaultdict(int)

section_count = defaultdict(defaultint)
sorted_secs = defaultdict(list)
total_articles = defaultdict(int)

for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
for sec in secs:
sec = sec.strip()
section_count[lang][sec] += 1

section_df = pd.DataFrame(columns=['lang','secname','freq'])
for lang, sec_dict in section_count.iteritems():
for secname, seccount in sec_dict.iteritems():
freq = seccount/float(total_articles[lang])
section_df = section_df.append({'lang':lang, 'secname':secname, 'freq':freq}, ignore_index=True)

#section_df = section_df.convert_objects(convert_numeric=True)
top_secs = section_df[section_df.freq > 0.1]

sort_secs= top_secs.sort(columns='freq', ascending=False)
return sort_secs
In [132]:
def top_templates_ethnoset(ethnoset_filename):

def defaultint():
return defaultdict(int)

template_count = defaultdict(defaultint)
sorted_templates = defaultdict(list)
total_articles = defaultdict(int)

for qid, attribdict in articles.iteritems():
for attribname, langdict in attribdict.iteritems():
if attribname == 'wikitext':
for lang, wikitext in langdict.iteritems():
if wikitext:
total_articles[lang] += 1
wikicode = pfh.parse(wikitext)
temps = wikicode.filter_templates()
for temp in temps:
tempname = temp.name
tempname = tempname.strip().lower()
template_count[lang][tempname] += 1

temp_df = pd.DataFrame(columns=['lang','tempname','freq'])
for lang, temp_dict in template_count.iteritems():
for tempname, tempcount in temp_dict.iteritems():
freq = tempcount/float(total_articles[lang])
temp_df = temp_df.append({'lang':lang, 'tempname':tempname, 'freq':freq}, ignore_index=True)

#section_df = section_df.convert_objects(convert_numeric=True)
top_templates = temp_df[temp_df.freq > 0.1]

sort_temps= top_templates.sort(columns='freq', ascending=False)
temps_dict = dict()
for lang in template_count.iterkeys():
try:
temps_dict[lang] = sort_temps[sort_temps.lang==lang].iloc[:20].convert_objects(convert_numeric=True)
except:
temps_dict[lang] = sort_temps[sort_temps.lang==lang].convert_objects(convert_numeric=True)

return temps_dict
In [384]:
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
sort_dfs = map(top_sections_ethnoset, filenames)
ethnosave/economy-cdi.jsonlatest.json
ethnosave/economy-fra.jsonlatest.json
ethnosave/economy-uga.jsonlatest.json
ethnosave/economy-usa.jsonlatest.json
ethnosave/geography-cdi.jsonlatest.json
ethnosave/geography-fra.jsonlatest.json
ethnosave/geography-uga.jsonlatest.json
ethnosave/geography-usa.jsonlatest.json
ethnosave/history-cdi.jsonlatest.json
ethnosave/history-fra.jsonlatest.json
ethnosave/history-uga.jsonlatest.json
ethnosave/history-usa.jsonlatest.json
In [124]:
def make_cite_plot():
citedf = pd.DataFrame(columns=['setname','cite','freq'])
for i in range(len(filenames)):
for lang, df in temp_dfs[i].iteritems():
if lang == 'en':
df = df[(df.tempname == 'cite web') | (df.tempname == 'cite book') | (df.tempname == 'cite news') | (df.tempname == 'cite journal')]
setname = filenames[i][10:-16]
tot = 0
for row in df.iterrows():
cols = row[1]
tot += cols['freq']
for row in df.iterrows():
cols = row[1]
citedf = citedf.append({'setname':setname, 'cite': cols['tempname'], 'freq':cols['freq']/float(tot)}, ignore_index=True)

cite_dict = {"cite book":536258, "cite journal":328129, "cite news":444447, "cite web":1560207}
globaltot = reduce(lambda a,b: a+b, cite_dict.itervalues())
globaltotfloat = float(globaltot)
globciteratio = map(lambda cd: (cd[0], cd[1]/globaltotfloat), cite_dict.iteritems() )

for cite in globciteratio:
citetype, freq = cite[0], cite[1]
citedf = citedf.append({'setname':'English WP Global', 'cite': citetype, 'freq':freq}, ignore_index=True)

citedf = citedf.convert_objects(convert_numeric=True)
citepiv = citedf.pivot(index = 'setname', columns = 'cite')
citeplot = citepiv.plot(kind='bar', stacked=True)
citeplot.legend(('Citation type', 'Cite book', 'Cite journal', 'Cite news', 'Cite web'), loc=9)
citeplot.figure.set_size_inches(12,8)
citeplot.set_xlabel('subject-nation')
citeplot.set_title('Composition of Citation Type, by Subject-Nation')

This is where I look at the template occurences.

In [383]:
ethnosaves = !ls ethnosave
filenames = map(lambda name: 'ethnosave/'+name, ethnosaves)
temp_dfs = map(top_templates_ethnoset, filenames)
for i in range(len(filenames)):
for lang, df in temp_dfs[i].iteritems():
print ''
#print filenames[i]
#print df