Monday, January 31, 2011

The What, Where and How of Open Data

Last week I attended a seminar at the Cathie Marsh Centre for Census and Survey Research, given by Rufus Pollock of the Open Knowledge Foundation (OKFN) on the topic of "open data".

Rufus started by showing two example applications built using open data. Yourtopia makes use of data from the World Bank that measures individual nations progress towards the Millennium Development Goals. Visitors to the site balance the relative importance of different factors (for example, "health", "economy" and "education"), and their preferences are matched with the data in order to suggest which country meets them most closely. Where Does My Money Go? offers various breakdowns of UK government spending and presents these in a way that allows the site visitor to see (for example) how much of the tax they pay is used for things such as defence, environment, culture and so on.

Both sites are eye-catching and fun (and can provide some surprising insights), while at the same time serving more serious purposes. In the context of the seminar Rufus noted that building the two sites also highlighted some key issues when working with these kinds of datasets:
  • Completeness: i.e. the data are not always complete
  • Correctness: i.e. the data are not always correct
  • Ease-of-use: it can take a lot of effort to put the data into a format where it can actually be used (for example an estimated 90% of the time developing Where Does My Money Go?, as opposed to 10% actually building the site)
These issues can largely be mitigated by "open data", which has two key characteristics:
  • Legal openness: the data must be provided with a licence that allows anyone to use, reuse and redistribute the data, for any purpose. ("Reuse" in this context can include combining it with other datasets and redistributing that.) An explicit open licence is required (such as those offered at Open Data Commons) because the default legal position for any data - even that posted "openly" on the web - doesn't entitle someone else to reuse or redistribute.
  • Technical openness: the data should be in a format that means that it's easy to access and work with, that it should be possible to obtain the data in bulk, and in a machine-readable, open format. These are pre-requisites for the data to be useful in a practical sense: for example, it's not sufficient to provide the data via a website that only returns subsets of that data via a form submission.
(See the official definition at

The data itself can be about almost anything: geographical (for example, mapping postcodes to a latitude and longitude), statistical, electoral, legal, financial - the OKFN's CKAN (Comprehensive Knowledge Archive Network) site has many examples. The key point is that the data should not be personal - that is, it shouldn't enable individuals to be identified, either directly or indirectly.

The motivation for making data open goes back to the initial issues of completeness, correctness and ease-of-use - it can take a lot of time to assemble a dataset (for example, the Government already collects a lot data), but once the effort has been made then the added cost of releasing it is small, and then sharing it reduces the cost of merging, filling gaps and correcting errors. To make an analogy with open source software, it's a essentially Linus' Law for data: "given enough eyeballs, all bugs are shallow". Rufus also talked about a corollary to this, the "many minds" principle: the best use of the data you produce will probably be thought of by someone else (and vice versa).

One argument against openness is that it precludes the possibility of commercial exploitation in order to offset the costs of compiling the data, and is a topical point given the current economic climate. Rufus's counter-argument is that there are many other ways to fund the creation of data aside from making it proprietary, by considering the data as a platform (rather than as a product), and building on that platform to sell extensions or complementary services (such as consultancy - again there are parallels with open source software). (Some of the audience expressed also concerns that in principle at least, open data is might be used irresponsibly - but arguably if the data is available to all then it means that others could challenge that interpretation.)

The final point that Rufus's talk addressed is how to actually build the open data ecosystem. To some degree it's up to the people who hold the data, but his suggestions are:
  • Start small and simple (which I took to mean, start with small sets of data rather than doing everything all at once).
  • If you're using someone else's dataset then you can make an enquiry via the OKFN website to find out what the licensing situation is.
  • If you have your datasets then put them under an open data licence and can register it at CKAN so that others can find it.
  • "Componentize" your data to make it easier to reuse (which I took to mean, divide the datasets up into sensible subsets).
  • Make the case with whoever holds the data you want (government, business etc) to release it openly.
For me as a "lay person", this was a fascinating introduction to the world of open data. Not unreasonably the seminar didn't go into details of actually working with such data (I think many of the seminar audience members were researchers already familiar with the available tools). However afterwards Rufus made the point that writing a paragraph of text after looking at the data is just as valid as the slick visualisations provided by Where Does My Money Go? and other sites. Ultimately it's having open access to the data in the first place that counts.

No comments:

Post a Comment