Seven deadly sins of data publication
Neil Ashton - October 17, 2013 in Data for CSOs
The advantages to non-governmental organizations of digitizing data are obvious. Digital data cannot, after all, be destroyed in a fire, unlike the 31,800,000 pages of irreplaceable Maharashtra government records that burned in 2012
But NGOs should take the further step of publishing their digital data. Publishing data improves not only an organization’s credibility but also its internal circulation of data. When data is made accessible to
others, it cannot help but become more accessible to its creators in the process.
There are many ways to prepare data for publication. Many of these ways are just plain wrong: they defeat the purpose of releasing data. Follow the righteous path and avoid these seven common errors when preparing your data for release.
1. Using PDFs
The popular PDF file format is a great to distribute print documents digitally. It is also worthless for distributing data. Using data stored in a PDF is only slightly easier than retyping it by hand.
Avoid distributing data in PDFs or other display-oriented formats like Word documents, rich text, HTML, or—worst of all—bitmap images. Instead of using publishing data tables in PDFs, use a machine-readable and open tabular data format like CSV.
2. Web interfaces
Reuse is the goal of data publication, and raw data is the easiest to reuse. Every technological trick standing between the data and the user is an obstacle. Fancy web interfaces constructed with Flash are the worst such obstacles.
A Flash web application is a reasonable choice if the goal is an interactive presentation of an interpretation of some data. But such an application is just an interpretation, and it keeps the data hidden from the user. Users may still be able to retrieve the data, but they will effectively have to hack the software to do so! Make it easy for them: consistently provide links to data
3. Malformed tables
Spreadsheet software makes it possible to decorate data with formatting that facilitates reading, such as sub-table headings and inline figures. These features are bad for data distribution. Data users will have to spend time stripping them away. Save time by not including them in the first place.
The ideal form of published tabular data is a simple “rectangular” table. Such a table has a one-to-one correspondence between data points and rows and has the same number of columns for each row, with every row having a value for every column. Missing values should be indicated with a special value rather than left blank. Sub-tables with different columns should either be broken into separate files or, if really necessary, aggregated into a single table by combining multiple tables’ columns. The result is a table with no “special” rows and a single set of columns.
4. No metadata
You may think that “raw data” does not come with a ready-made interpretation. Not so. There should always be an intended interpretation of the units of measurement, the notation for missing values, and so on. If no indication of this basic interpretation is provided, the user has to guess. Include metadata which saves them the trouble.
Standards like the Data Package (for general data) or the Simple Data Format (for CSV files) allow you to include metadata with data as a simple JSON file. The metadata should include at least the units of measurement for quantitative values, the meaning of qualitative values, the format for dates, and the notation for a missing value.
5. Inconsistency within datasets
Inconsistencies are more common than actual errors. Inconsistencies include mayhem like haphazard units of measurement and multiple names given to the same entities. These problems are so widespread that “data cleaning”, which mostly means eliminating inconsistencies, is the first step in all data wrangling projects. Help make data cleaning a thing of the past by carefully checking your data for consistency before releasing it.
6. Inconsistency across datasets
Publication of data is a commitment not made lightly. Once a format for data and venue for data publication has been chosen, make an effort to stick to them for all future data releases of the same type.
Data is most useful when different datasets can be combined to test wide-ranging hypotheses. Not maintaining a single standard for data of a single type turns otherwise comparable data into a disconnected mess which requires considerable effort to put together. Make data freely remixable: adopt a consistent standard for data and metadata across as many datasets as possible.
7. Bad licensing
Who can use your organization’s data? The license under which data is released is a major part of the answer to this question. There is very little point in releasing data at all under a restrictive license—and if the licensing is left unspecified, the data will exist in a state of
legal limbo.
Consider making your data available under a permissive “open” license like the Open Data Commons Open Database License. Once you choose a suitable license for your data, indicate this license in the data’s metadata.