Opinion: The case for metadata

People collecting data. Photo by: Todd Huffman / CC BY

The introduction of the Sustainable Development Goals has brought with it calls for a data revolution: transformative action to respond to the data demands of the post-2015 development agenda.

Much of the focus of the data revolution has been on data collection, capacity building, closing data gaps and using data to promote transparency and accountability. While making improvements in each of these areas is absolutely necessary, there is a pressing need to take the revolution further by going beyond the data itself. We need to see the data revolution extend to cover a revolution in the information that sits behind data — the what, the how, the who and the when — because this revolution would make planning and monitoring spending on the SDGs a far more manageable task.

Take the example of SDG 4 — quality education — in Kenya. To understand what resources are being targeted at meeting this goal, we need to analyze and compare data from a number of sources. This would include the Kenyan government’s current budget data and data on its future spending plans. We would also want to look at donor data reported through international systems, and perhaps the indicators produced by the national statistics office. But before we get started on comparing corresponding datasets we need to ask a series of questions: When was this dataset published? Who published it? How was the data gathered? What does this dataset contain?

In essence, we’re really asking: Can we trust this data?

In some cases the answers to these questions are out there, but if they are, too often they require patience, perseverance and manual intervention to find. Imagine then how much easier it would be if each dataset were equipped with this information and more. And imagine how much we could learn if that information was hyperlinked to show us where to find other related datasets.

Read related stories:

► How Bloomberg's Data for Health initiative is helping reshape Australian aid

► How M&E data can hide development impact

► Why invest in health evidence? Q&A with Chris Murray of IHME

► What drives the Gates Foundation's global health work? Data.

► Q&A: Laura Scanlon, director of TEGA, on building girl-powered data

The information we’re talking about here is metadata. Like the information that’s held about a publication in a library catalogue — information about the title of a book, the author, the date of publication, the ISBN number — metadata associated with datasets tells us critical information about the provenance of the data we’re looking at. In doing so, metadata helps us to establish whether or not we can trust that data. And this question of trust is particularly important as more and more data is published openly. We can access data about so many topics, sectors and people, but do we know where it originated, how current it is or when it will next be updated?

Very often the answer to these questions is no, because the metadata that accompanies an open dataset is either patchy or missing altogether. Data publishers are not yet consistently publishing the metadata that is required to establish the origins of the data they’re sharing.

Beyond the question of trust, this lack of metadata becomes problematic when we are trying to make sense of a dataset, compare it with another or discover other related datasets. How can we track — for example, progress over time — or understand how data about one location compares to data about the same location from a different source, if we’re not equipped with basic information about the data we’re looking at?

These are vital comparisons if we are to see evidence-led decisions being made about how best to meet the SDGs, or to track our progress toward meeting them, especially if using datasets from various sources.

Importantly, metadata is not only useful for describing datasets and establishing trust. Metadata can also help build links between datasets, to highlight where a newly published version of the data is held, or to direct us to historical data that would help to build a baseline. The benefit of including pointers to other related datasets within metadata would the generation of a network of linked datasets, which in turn would enable searches of multiple datasets through a single query. This would be invaluable in performing analyses such as the Kenya example above.

With the benefits of publishing metadata being so clear, it is crucial that within the data revolution a metadata revolution also takes place. As a minimum, data producers should be attaching metadata to all the data that they publish, and the W3C’s Data on the Web Best Practices provides a good starting point. To derive the maximum benefit from publishing this metadata though, data producers should go a step further: They should publish metadata to the same standard or, at least, should seek interoperable solutions to ensure that the data they are publishing becomes easily discoverable to any potential data users across platforms.

The technology to do all this already exists; metadata standards have been around for a while. So it’s now a case of will — the will of data producers to publish metadata to make sure that their data can be found, trusted and used in evidence-led decision-making.

Join the Devex community and access more in-depth analysis, breaking news and business advice — and a host of other services — on international development, humanitarian aid and global health.