3ways

3 Ways To Access Wikidata Data: Python, Dumps, and Linked Data

The inaugural Wiki Research Hackathon went very well, and I’m affirmed that I feel best when I’m conducting Wiki Research. I was asked to give one of the tech talks of the day about accessing Wikidata data programmatically. Here is an outline of the talk

Purpose:

We’ll be viewing Wikidata as file in its own right for research, not as it’s canonical use case of being used in various Wikipedias.

Native format:

Wikidata is a mostly standard Mediawiki instance except that pages don’t store “Wikitext”, they  store JSON blobs. (If you want to understand more about this abstraction, see then ContentHandler).

Structure of a Wikidata Item:

Main entry point of any Wikidata item is a JSON dictionary, that has this form:

{“labels”: by-language dictionary
“descriptions”: by-language dictionary
“aliases”: by-language dictionary
“claims”: list of property and values
“sitelinks”: by-language dictionary}

 

3 Ways To Access Wikidata:

Whether your more comfortable in object-oriented python, parsing large text files, or munging linked data, there is something for you.

Using Pywikibot:

With pywikibot you get almost full support of the API.
New classes in the “core” branch
class WikibasePage(Page):
class ItemPage(WikibasePage):
class PropertyPage(WikibasePage):
class Claim(PropertyPage):
Using Pywikibot
Classic pywikibot pagegenerators work.
#make a generator for all the pages with a property
en_wikipedia = pywikibot.Site(‘en’, ‘wikipedia’)
wikidata = en_wikipedia.data_repository()
property_page = pywikibot.ItemPage(wikidata, ‘Property:P21′)
pages_with_property = property_page.getReferences()

Pywikibot example:

I’ve been harvesting Infobox Book across many languages and writing the corresponding properties to Wikidata https://github.com/notconfusing/harvest_infobox_book.

Using wda

Update: WDA is deprecated and replaced by Wikidata Toolkit, which I explain how to use with code examples in this blog post.

WDA, WikiData Analytics, downloads the official dump and analyzes it offline. Cleverly it uses nightly incremental dumps after about a 10GB first download. It’s also written in python, mainly by Markus Kroetzsch.After downloading there is a parser that writes a file called kb.txt. kb.txt stores plaintext triples, one per line giving you something like this.

Q21 link {trwiki:İngiltere} .
Q21 link {hewiki:אנגליה} .
Q21 alias {en:ENG} .
Q21 alias {min:Inggirih} .
Q21 alias {sgs:England} .
Q21 P31 Q1763527 .
Q21 P47 Q22 .
Q21 P47 Q25 .
Q21 P41 {Flag of England.svg} .

I used wda to in my analysis of the most unique Wikipedias according to Wikidata.

Content Negotiation:

You can also access Wikidata as linked data. The build path is:

https://wikidata.org/entity/<QID>.<format>

where your choices of format are

nt
rdf
ttl

Content Negotiaton Example

https://www.wikidata.org/wiki/Special:EntityData/Q42046.ttl @prefix entity: <http://www.wikidata.org/entity/> . @prefix wikibase: <http://www.wikidata.org/ontology#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix skos: <http://www.w3.org/2004/02/skos/core#> . @prefix schema: <http://schema.org/> . @prefix data: <http://www.wikidata.org/wiki/Special:EntityData/> . @prefix cc: <http://creativecommons.org/ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
entity:Q42046
 a wikibase:Item ;
 rdfs:label "鬣狗科"@zh, "Hienowate"@pl, "Hiena"@eu, "Hyaenidae"@es, "Hiëna"@af, "Dubuk"@ms, "Hiénafélék"@hu, "Fisi"@sw, "Hüäänlased"@et, "হায়েনা"@bn, "Hiena"@sq, "Hyaenidae"@br, "Ύαινα"@el

Conclusion

So until Phase III there are still some usable options to explore Wikidata for research purposes. However we can still dream of future robust query system. In that dream I like to think of a query system capable of answering “does there exists is a sequence of properties that connects these two Wikidata items?”
-Max

This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

One thought on “3 Ways To Access Wikidata Data: Python, Dumps, and Linked Data”

Leave a Reply