You are browsing the archive for Data Stories.

Seven Ways to Create a Storymap

- August 25, 2014 in Data Journalism, Data Stories, HowTo, Storytelling

If you have a story that unfolds across several places over a period of time, storymaps can provide an engaging interactive medium with which to tell the story. This post reviews some examples of how interactive map legends can be used to annotate a story, and then rounds up seven tools that provide a great way to get started creating your own storymaps.

Interactive Map Legends

The New York Times interactives team regularly come up with beautiful ways to support digital storytelling. The following three examples all mahe use of floating interactive map legends to show you the current location a story relates to as they relate a journey based story.

Riding the Silk Road, from July 2013, is a pictorial review featuring photographs captured from a railway journey that follows the route of the Silk Road. The route is traced out in the map on the left hand side as you scroll down through the photos. Each image is linked to a placemark on the route to show where it was taken.

Riding_the_New_Silk_Road_-_Interactive_Feature_-_NYTimes_com

The Russia Left Behind tells the story of a 12 hour drive from St. Petersburg to Moscow. Primarily a textual narrative, with rich photography and video clips to illustrate the story, an animated map legend traces out the route as you read through the story of the journey. Once again, the animated journey line gives you a sense of moving through the landscape as you scroll through the story.

The_Russia_Left_Behind

A Rogue State Along Two Rivers, from July 2014, describes the progress made by Isis forces along the Tigris and Euphrates Rivers is shown using two maps. Each plots the course of one of the rivers and uses place linked words and photos to tell the story of the Isis manoeuvres along each of the river ways. An interactive map legend shows where exactly along the river the current map view relates to, providing a wider geographical context to the local view shown by the more detailed map.

A_Rogue_State_Along_Two_Rivers_-_NYTimes_com

All three of these approaches help give the reader a sense of motion though the journey traversed that leads the narrator being in the places described at different geographical storypoints described or alluded to in the written text. The unfolding of the map helps give the reader the sense that a journey must be taken to get from one location to another and the map view – and the map scale – help the reader get a sense of this journey both in terms of the physical, geographical distance it relates to and also, by implication, the time that must have been expended on making the journey.

A Cartographic Narrative

Slave Revolt in Jamaica, 1760-1761, a cartographic narrative, a collaboration between Axis Maps and Harvard University’s Vincent Brown, describes itself as an animated thematic map that narrates the spatial history of the greatest slave insurrection in the eighteenth century British Empire. When played automatically, a sequence of timeline associated maps are played through, each one separately animated to illustrate the supporting text for that particular map view. The source code is available here.

Jamaican_Slave_Revolt

This form of narrative is in many ways akin to a free running, or user-stepped, animated presentation. As a visual form, it also resembles the pre-produced linear cut scenes that are used to set the scene or drive the narrative in an interactive computer game.

Creating you own storymaps

The New York Times storymaps use animated map legends to give the reader the sense of going on a journey by tracing out the route being taken as the story unfolds. The third example, A Rogue State Along Two Rivers also makes use of a satellite map as the background to the story, which at it’s heart is nothing more than a set of image markers placed on to an an interactive map that has been oriented and constrained so that you can only scroll down. Even though the maps scrolls down the page, the inset legend shows the route being taken may not be a North-South one at all.

The linear, downward scroll mechanic helps the reader feel as if they are reading down through a story – control is very definitely in the hands of the author. This is perhaps one of the defining features of the story map idea – the author is in control of unraveling the story in a linear way, although the location of the story may change. The use of the map helps orientate the reader as to where the scenes being described in the current part of the story relate to, particularly any imagery.

Recently, several tools and Javascript code libraries have been made available from a variety of sources that make it easy to create your own story maps within which you can tell a geographically evolving story using linked images, or text, or both.

Knight Lab StoryMap JS

The Knight Lab StoryMap JS tool provides a simple editor synched to a Google Drive editor that allows you to create a storymap as a sequence of presentation slides that each describe a map location, some header text, some explanatory text and an optional media asset such as an image or embedded video. Clicking between slides animates the map from one location to the next, showing a line between consecutive points to make explicit the linkstep between them. The story is described using a custom JSON data format saved to the linked Google Drive account.

knightStoryMap_editor

[StoryMapJS code on Github]

CartoDB Odyssey.js

Odyssey.js provides a templated editing environment that supports the creation of three types of storymap: a slide based view, where each slide displays a location, explanatory text (written using markdown) and optional media assets; a scroll based view, where the user scrolls down through a stroy and different sections of the story trigger the display of a particular location in a map view fixed at the top of the screen; and a torque view which supports the display and playback of animated data views over a fixed map view.

Odyssey_js_Sandbox

A simple editor – the Odyssey sandbox – allows you to script the storymap using a combination of markdown and map commands. Storymaps can be published by saving them to a central githib repository, or downloaded as an HTML file that defines the storymap, bundled within a zip file that contains any other necessary CSS and Javascript files.

[Odyssey.js code on Github]

Open Knowledge TimeMapper

TimeMapper is an Open Knowledge Labs project that allows you to describe location points, dates, and descriptive text in a Google spreadsheet and then render the data using linked map and timeline widgets.

Create_-_TimeMapper_-_Make_Timelines_and_TimeMaps_fast__-_from_the_Open_Knowledge_Foundation_Labs

[Timemapper code on Github]

JourneyMap (featuring waypoints.js

JourneyMap is a simple demonstration by Keir Clarke that shows how to use the waypoints.js Javascript library to produce a simple web page containing a scrollable text area that can be used to trigger the display of markers (that is, waypoints) on a map.

Journey_Map

[waypoints.js on Githhub; JourneyMap src]

Google Earth TourBuilder

Google Earth TourBuilder is a tool for building interactive 3D Google Earth Tours using a Google Earth browser plugin. Tours are saved (as KML files?) to a Google Drive account.

Tour_Builder

[Note: Google Earth browser plugin required.]

ESRI/ArcGIS Story Maps

ESRI/ArcGIS Story Maps are created using an online ArcGIS account and come in three type with a range of flavours for each type: “sequential, place-based narratives” (map tours), that provide either an image carousel (map tour) that allows you to step through a sequence of images that are displayed separately alongside a map showing a corresponding location or a scrollable text (map journal) with linked location markers (the display of half page images rather than maps can also be triggered from the text); curated points-of-interest lists that provide a palette of images, each member of which can be associated with a map marker and detailed information viewed via a pop-up (shortlist), a numerically sequence list that displays map location and large associated images (countdown list), and a playlist that lets you select items from a list and display pop up infoboxes associated with map markers; or map comparisons provided either as simple tabbed views that allow you to describe separate maps, each with its own sidebar description, across a series of tabs, with separate map views and descriptions contained within an accordion view, and swipe maps that allow you to put one map on top of another and then move a sliding window bar across them to show either the top layer or the lower layer. A variant of the swipe map – the spyglass view alternatively displays one layer but lets you use a movable “spyglass” to look at corresponding areas of the other layer.

App_List___Story_Maps

[Code on github: map-tour (carousel) and map journal; shortlist (image palette), countdown (numbered list), playlist; tabbed views, accordion map and swipe maps]

Leaflet.js Playback

Leaflet.js Playback is a leaflet.js plugin that allows you to play back a time stamped geojson file, such as a GPS log file.

LeafletPlayback

[Code on Github]

Summary

The above examples describe a wide range of geographical and geotemporal storytelling models, often based around quite simple data files containing information about individual events. Many of the tools make a strong use of image files as pat of the display.

it may be interesting to complete a more detailed review that describes the exact data models used by each of the techniques, with a view to identifying a generic data model that could be used by each of the different models, or transformed into the distinct data representations supported by each of the separate tools.

UPDATE 29/8/14: via the Google Maps Mania blog some examples of storymaps made with MapboxGL, embedded within longer form texts: detailed Satellite views, and from the Guardian: The impact of migrants on Falfurrias [scroll down]. Keir Clarke also put together this demo: London Olympic Park.

UPDATE 31/8/14: via @ThomasG77, Open Streetmap’s uMap tool (about uMap) for creating map layers, which includes a slideshow mode that can be used to create simple storymaps. uMap also provides a way of adding a layer to map from a KML or geojson file hosted elsewhere on the web (example).

Flattr this!

Festing with School of Data

- July 8, 2014 in Community, Data Expeditions, Data Stories, Events, School_Of_Data

School of Data Fellows, Partners, Friends, staff and supporters will converge on Berlin next week for OKFestival: July 15 – 17, 2014. We know that many of you may be attending the festivities and we’d love to connect.

Mingling: Science is Awesome!

Tuesday, July 15, 2014 18:00 CET
OKfestival starts with a Science Fair to help you get to a taste of of all the amazing people and activities. We’ll be there to share School of Data with the large global community. Please stop by and say hi!

Activity: Be A Storyteller

July 15 – 17, 2014
As those of you who have attended Data Expeditions before, being able to tell an impactful story is key to success. Join the Storytelling team as we meander through the festival collecting and sharing real-time stories. To join.

Session: How to Teach Open Data

Thursday, July 17th, 2014 15:30 – 16:30 CET
Are you passionate about teaching data and tech? Are you striving to support a community of data teachers and learners? Are you keen to exchange experiences with other professionals in the field of teaching data? Then this is the right session for you.
Join us for a conversation about standards and methodologies for data teaching with School of Data, Peer to Peer University and Open Tech School.

  • How to organise tech and data workshops
  • Building effective curriculum and accreditation
  • Type of education activities: a blended offline, online
  • Designing passion driven communities

More about the session.

Informal Session: How to Build a School of Data

Thursday, July 17, 2014 16:30 – 17:15 CET (same room as the previous session.)
Are you keen to join School of Data? Do you want to set up a School of Data instance in your locale? Join us to meet staff, fellows and partners. We’ll answer your questions and start the conversations.

Most of all – happy Festing!

(Note: For those of you are unable to attend OKfestival, we’ll be sure to share more details post-event. See you online.)

Flattr this!

Why should we care about comparability in corruption data?

- May 29, 2014 in Data Expeditions, Data Stories

Does comparing diverse data-driven campaigns empower advocacy? How can comparing data on corruption across countries, regions and contexts contribute to efforts on the ground? Can the global fight against corruption benefit from comparable datasets? The engine room tried to find some answers through two back-to-back workshops in London last April, in collaboration with our friends from School of Data and CIVICUS.

The first day was dedicated to a data expedition, where participants explored a collection of specific corruption-related datasets. This included a wide range of data, from international perception-based datasets such as Transparency International’s Global Corruption Barometer, through national Corruption Youth Surveys (Hungary), to citizen-generated bribe reports like I Paid A Bribe Kenya.

SCODAwall2

Hard at work organizing corruption datatypes.

The second day built on lessons learned in the data expedition. Technologists, data literates and harmonization experts convened for a day of brainstorming and toolbuilding. The group developed strategies and imagined heuristics through an analysis of existing cases, best practices and personal experience.

Here is what we learned:

Data comparability is hard

Perhaps the most important lesson from the data expedition was that one single day of wrangling can’t even begin to grasp the immensely diverse mix of corruption data out there. When looking at scope, there was no straightforward way to find links between locally sourced data and the large-scale corruption indices. Linguistic and semantic challenges to comparing perceptions across countries were an area of concern. Since datasets were so diverse, groups spent a considerable amount of time familiarizing themselves with the available data, as well as hunting for additional datasets. Lack of specific incident-reporting datasets was also noticeable. In the available datasets, corruption data usually meant corruption perception data: data coming from surveys gauging people’s feelings about the state of corruption in their community. Datasets containing actual incidents of corruption (bribes, preferred sellers, etc) were less readily available. Perception data is crucial for taking society’s pulse, but is difficult to compare meaningfully across different contexts — especially considering the fluidity of perception in response to cultural and social customs — and very complex to cross-correlate with incident reporting.

Finding patterns in caos

Pattern-finding expedition

An important discussion also came to life regarding the lack of technical capacity among grassroots organizations that collect data, and how that negatively impacts the data quality. For organizations on the ground it’s a question of priorities and capacity. Organisations that operate in dangerous areas, responding to urgent needs with limited resources, don’t necessarily consider data collection proficiency a top-shelf item. In addition, common methods and standards in data collection empower global campaigns for remote actors (cross-national statistics, high-level policy projects etc) but don’t necessarily benefit the organizations on the ground collecting the data. These high-level projects may or may not have trickle-down benefits. Grassroots organizations don’t have a reason to adopt standardized data collection practices, unless it helps them in their day-to-day work: for example providing tools that are easier to use, or having the ability to share information with partner organizations.

Data comparability is possible

While the previous section might paint a black picture, the reality is more positive, and the previous paragraph tells us where to look (or, how to look). The amorphous blob of all corruption-related data is too generically daunting to make sense of — until we flip the process on its head. Like in the best detective novels, starting small and investigating specific local stories of corruption lets investigators find a thread and follow it along, slowly unraveling the complex yarn of corruption towards the bigger picture. So for example, a small village in Azerbaijan complaining about the “Ingilis” that contaminate their water can unravel a story of corruption leading all the way to the presidential family. This excellent example, and many more, come from Paul Radu’s investigative experience, described in the Exposing the Invisible project produced by the Tactical Technology Collective.

Screengrab from "Our Currency is Information" by Tactical Technology Collective

Screengrab from “Our Currency is Information” by Tactical Technology Collective

There are also excellent resources that collect and share data in comparable, standardized and functional ways. Open Corporates, for example, collects information on more than 60 million corporations, and provides beautiful, machine-readable, API-pluggable information, ready to be perused by humans and computers, and easily comparable and mashable. If your project involves digging through corporation ownership, Open Corporates will most surely be able to help you out. Another project of note is the Investigative Dashboard that collects scraped business records from numerous countries, as well as hundreds of reference databases.

What happens when datasets just aren’t compatible, and there is no easy way to convince the data producers to make them more user-friendly? Many participants voiced their trust in civic hackers and the power of scraping — even if datasets aren’t provided in machine-readable formats, or standardized and comparable, there are many tools (as well as many helpful people) that can come to the rescue. The best source for finding both? Well, the School of Data, of course. Apart from providing a host of useful tutorials and links, it acts as a hub for engaged civic hackers, data wranglers and storytellers all over the world.

Citizen engagement is key

During a brainstorm where participants compared real-life models of data mashups (surveys, incident reporting, budget data), it became clear that many corruption investigation projects involve crowdsourced verification. While crowdsourcing is a vague concept in itself, it can be very powerful when focused within a specific use case. It’s important for anti-corruption projects that revolve around leaked data (such as the Yanukovych leaks), or FOIA requests that yield information in difficult-to-parse formats that aren’t machine readable (badly scanned documents, or even boxes of paper prints). In cases like these, citizen engagement is possible because there are clear incentives for individuals to get involved. Localized segmentation (where citizens look only at data directly involving them or their communities) is a boon for disentangling large lumps of data, as long as the information interests enough people to engage a groundswell of activity. Verification of official information can also help, for example when investigating whether state-financed infrastructures are actually being built, or if there is just a very expensive empty lot where a school is supposed to be.

It makes perfect sense, then, to look at standardization and comparability as an enabling force for citizen engagement. The ability to mash and compare different datasets brings perspective, and enables the citizens themselves to have a clearer picture, and act upon that information to hold their institutions accountable. However, translating, parsing and digesting spaghetti-data can be so time-consuming and cumbersome that organisations might just decide it’s not worth the effort. At the same time, data-collecting organizations on the ground, presented with unwieldy, overly complex standards, will simply avoid using them and compound the comparability problem. The complexity in the landscape of corruption data represents a challenge that needs to be overcome, so that data being collected can truly inspire citizen action for change.

Flattr this!

Learning to Listen to your Data

- March 27, 2014 in Data Stories

School of Data mentor Marco Túlio Pires has been writing for our friends at Tactical Tech about journalistic data investigation. This post “talks us through how to begin approaching and thinking about stories in data”, and it was originally published on Exposing the Invisible‘s resources page.

Journalists used to carefully establish relationships with sources in the hope of getting a scoop or obtaining a juicy secret. While we still do that, we now have a new source which we interrogate for information: data. Datasets have become much like those real sources – someone (or something!) that holds the key to many secrets. And as we begin to treat datasets as sources, as if they were someone we’d like to interview, to ask meaningful and difficult questions to, they start to reveal their stories, and more often than not, we come across tales we weren’t even looking for.

But how do we do it? How can we find stories buried underneath a pile of raw data? That’s what this post will try to show you: the process of understanding your data and listening to what your “interviewee” is trying to tell you. And instead of giving you a lecture about the ins and outs of data analysis, we’ll walk you through an example.

Let’s take an example from the The Guardian, the British newspaper that has a very active data-driven operation. We’re going to (try to) “reverse engineer” one of their stories in the hopes you get a glimpse at what happens when you go after information that you have to compile, clean, and analyse and what kind of decisions we make along the way to tell a story out of a dataset.

So, let’s talk about immigration. Every year, the Department of Immigration and Border Protection of Australia publishes a bunch of documents about immigration statistics down under. Published last year, the team at The Guardian focused on a report called Asylum Trends for 2011-2012. There’s a more up-to-date version available (2012-2013). By the end of this exercise, we hope you can use the newer version to compare it with the dataset used by The Guardian. Let us know in the comments about your findings.

The article starts with a broad question: does Australia have a problem with refugees? That’s the underlying question that helps makes this story relevant. It’s useful to start a data-driven investigation with a question, something that bothers you, something that doesn’t seem quite right, something that might be an issue for a lot of people.

With that question in mind, I quickly found a table on page 2 with the total number of people seeking protection in Australia.

People seeking Austria's protection

Let’s make a chart out of this and see what the trend is. Because this is a pesky PDF file, you’ll need to either type the data by hand into your spreadsheet processor or use an app to do that for you. For a walkthrough of a tool that does this automatically, see the Tabula example here.

After putting the PDF into Tabula this is what we get (data was imported into OpenOffice Calc):

Tabula

I opened the CSV file in OpenOffice Calc and edited it a bit to make it clearer. Let’s see how the number of people seeking Australia’s protection has changed over the years. Using the Chart feature in the spreadsheet, we can compare columns A and D by making a line chart.

Line chart

Take a good look at this chart. What’s happening here? On the vertical axis, we see the total number of people asking for Australia’s protection. On the horizontal axis, we see the timeline year by year. Between 2003 and 2008, there’s no significant change. But something happened from 2009 on. By the end of the series, it’s almost three times higher. Why? We don’t know yet. Let’s take a look at other data from the PDF and use Tabula to import it to our spreadsheet. Maybe that will show us what’s going on.

Australia divides their refugees into two groups: those who arrived by boat and those who arrived by air. They use the acronyms IMA and non-IMA (IMA stands for Irregular Maritime Arrivals). Let’s compare the totals of the two groups and see how they relate across the years presented in this report. Using Table 4 and Table 25, we’ll create a new table that has the totals for the two groups. Be careful, though, the non-IMA table goes back up to 2007, but the IMA table goes only as far as 2008. Let’s create a line chart with this data.

image 5

What’s that? It seems that in 2011-2012, for the first time in this time series, the number of refugees arriving in Australia by boat surpassed those landing by plane. The next question could be: where are all the IMA refugees coming from? We already have the data from table 25. Let’s make a chart out of that, considering the period 2011-2012. That would be columns A and E of our data. Here’s a donut chart with the information:

Donut chart

Afghanistan (deep blue) and Iran (orange) alone represent more than 64% of all IMA refugees in Australia in 2011-2012.

From here, there are a lot of routes we could take. We could use the report to take a look at the age of the refugees, like the folks at The Guardian did. We could compare IMA and non-IMA countries and see if there’s a stark difference and, if so, ask why that’s the case. We could look at why Afghans and Iranians are travelling by boat and not plane, and what risks they face as a result. How does the data in this report compare with the data from the more recent report? The analysis could be used to come up with a series of questions to ask to the Australian government or a specialist on immigration.

Whatever the case might be, it’s worth remembering that finding stories in data should never be an activity that ends in itself. We’re talking about data that’s built on the behavior of people, on the real world. The data is always connected to something out there, you just need to listen to what your spreadsheet is saying. What do you say? Got data?

Flattr this!

The World Tweets Nelson Mandela’s Death

- December 10, 2013 in Data Stories, Mapping, Storytelling, Visualisation

The World Tweets Nelson Mandela’s DeathClick here to see the interactive version of the map above 

Data visualization is awesome! However, it conveys its goal when it tells a story. This weekend, Mandela’s death dominated the Twitter world and hashtags mentioning Mandela were trending worldwide. I decided to design a map that would show how people around the world tweeted the death of Nelson Mandela. First, I started collecting tweets associated with #RIPNelsonMandela using ScraperWiki. I collected approximately 250,000 tweets during the death day of Mandela. You can check this great recipe at school of data blog on how to extract and refine tweets.

scraperwiki

After the step above, I refined the collected tweets and uploaded the data into CartoDB. It is one of my favorite open source mapping tools and I will make sure to write a CartoDB tutorial in future posts. I used the Bubble or proportional symbol map which is usually better for displaying raw data. Different areas had different tweeting rates and this reflected how different countries reacted. Countries like South Africa, UK, Spain, and Indonesia had higher tweeting rates. The diameter of the circles represents the number of retweets. With respect to colors, the darker they appeared, the higher the intensity of tweets is.

That’s not the whole story! Basically, it is easy to notice that some areas have high tweeting rates such as Indonesia and Spain. After researching about this topic, it was quite interesting to know that Mandela had a unique connection with Spain, one forged during two major sporting events. In 2010, Nelson Mandela was present in the stadium when Spain’s international football team won their first ever World Cup Football trophy as well. Moreover, for Indonesians, Mandela has always been a source of joy and pride, especially as he was fond of batik and often wore it, even in his international appearances.

Nonetheless, it was evident that interesting insights can be explored and such data visualizations can help us show the big picture. It also highlight events and facts that we are not aware of in the traditional context.

Flattr this!

Visiting Electionland

- November 6, 2013 in Data Stories, HowTo, R, Visualisation


After the German elections, data visualization genius Moritz Stefaner created a map of election districts, grouping them not by geography but by election patterns. This visualisation impressively showed a still-existing divide in Germany. It is a fascinating alternative way to look at elections. On his blog, he explains how he did this visualization. I decided to reconstruct it using Austrian election data (and possibly more countries coming).

Austria recently published the last election’s data as open data, so I took the published dataset and cleaned it up by removing summaries and introducing names for the different states (yes, this is a federal state). Then I looked at how to get the results mapped out nicely.

In his blog post, Moritz explains that he used Z-Scores to normalize data and then used a technique called Multidimensional Scaling (MDS) to map the distances calculated between points into 2-dimensional space. So I checked out Multidimensional Scaling, starting on Wikipedia, where I discovered that it’s linear algebra way over my head (yes, I have to finish Strang’s course on linear Algebra at some point). The Wikipedia article fortunately mentions a R command cmdscale that does multidimensional scaling for you. Lucky me! So I wrote a quick R script:

First I needed to normalize the data. Normalization becomes necessary when the raw data itself is very hard to compare. In election data, some voting stations will have a hundred voters, some a thousand; if you just take the raw vote-count, this doesn’t work well to compare, as the numbers are all over the place, so usually it’s broken down into percentages. But even then, if you want to value all parties equally (and have smaller parties influence the graph as much as larger parties), you’ll need to apply a formula to make the numbers comparable.

I decided to use Z-Scores as used by Moritz. The Z-Score is a very simple normalization score that takes two things, the mean and the standard deviation, and tells you how many standard deviations a measurement is above the average measurement. This is fantastic to use in high-throughput testing (the biomed nerd in me shines through here) or to figure out which districts voted more than usual for a specific party.

After normalization, you can perform the magic. I used dist to calculate the distances between districts (by default, this uses Euclidean distance) and then used cmdscale to do the scaling. Works perfectly!

With newly created X and Y coordinates, the only thing left is visualization—a feat I accomplished using D3 (look at the code—danger, there be dragons). I chose a simpler way of visualizing the data: bubbles the size of voters in the district, the color of the strongest party.

Wahlland visualization of Austrian general Elections 2013
(Interactive version)

You can see: Austria is less divided than Germany. However, if you know the country, you’ll find curious things: Vienna and the very west of Austria, though geographically separated, vote very similarly. So while I moved across the country to study when I was 18, I didn’t move all that much politically. Maybe this is why Vienna felt so comfortable back then—but this is another story to be explored another time.

Flattr this!

Findings of the investigation of garment factories of Bangladesh

- October 29, 2013 in Community, Data Expeditions, Data Stories

BANGLADESH-BUILDING/

Credit: Weronika (Flickr) – Some rights reserved.

 

Connecting the Dots: Mapping the Bangladesh Garment Industry

This post was written in collaboration with Matt Fullerton.

During the weekend of October 18th-October 20th, a group of volunteers, data-wranglers, geo-coders, and activists teamed up with the International Labor Rights Forum and P2PU for a Data Expedition to investigate the Garment Factories. We set out to connect the dots between Bangladeshi garment producers and the clothes that you purchase from the shelves of the world’s largest retailers.

Open Knowledge Foundation Egypt and Open Knowledge Foundation Brasil ran onsite Data Expeditions on garment factories and coordinated with the global investigation.

In previous endeavors, School of Data had examined the deadly history of incidents in garment factories in Bangladesh and the location of popular retailers’ clothing production facilities. This time around, we worked draw the connections between the retailers that sell our clothes, the factories that make it, the safety agreements they’ve signed, the safety of those buildings, and the workers who occupy them day and night.

Sources of Bangladeshi Garment Data

The Importance of the Garment Industry In Bangladesh

Bangladesh, as many people are aware, is a major provider of garment manufacturing services and the industry is vital to Bangladesh’s economy, accounting for over 75% of the country’s exports and 17% of the country’s GDP. As in many developing countries, conditions can be harsh with long hours and unsafe working conditions. This project seeks to provide a resource which can then be used to drive accountability for these conditions and improve the lives and livelihood of average garment worker.

What’s Being Done

Many organisations and agreements already seek to promote the garment industry in Bangladesh and to ensure worker health and safety (Bangladesh Garment Manufacturers and Exporters Association (BGMEA), Bangladesh Safety Accord, Alliance for Bangladesh Worker Safety, International Labor Rights Forum (ILRF), Clean Clothes Campaign (CCC), Fair Wear Foundation, The Solidarity Center). Collectively, these groups provide a range of data on Bangladeshi Garment factories: where they are located, safety incidents, and what retailers the factories supply. Our goal focused on connecting suppliers to sellers within the datasets, and geographically plotting the results on an interactive map. Ultimately, we seek to create a usable tool that is filterable on several criteria, specifically on membership to the various organisations and safety agreements which exist, the factory incident history, and the retailers that are being supplied by these factories. Styling of point radii would allow a quick overview of e.g. the number of workers and pop-up information could include additional data from the certification and auditing data including addresses, contact information, website addresses, incidents, and many more.

We made significant progress at the Data Expedition of October 20-21 as we:

Keep Moving Forward

We however do not want to stop here. Rather, we see this as simply the beginning of a longer international collaborative project to make it possible for you to find out who created your clothing and under what conditions.

Get involved in the continued investigation of the garment factories by:

Flattr this!

Visualizing the US-Government Shutdown

- October 1, 2013 in Data Stories, HowTo


As of Today the US Government is in Shutdown. That means that a lot of employees are sent home and services don’t work. The Washington Post Wonkblog has a good story on What you need to know about the shutdown. In the story they list government departments and the percentage of employees to be sent home. I thought: this could be done better – visual!

Gathering the Data

The first thing I did is gather the data (I solely used the blog post mentioned above as a source). I started a Spreadsheet containing all the data on departments. I decided to do this manually, since the data is pretty unstructured and keep the descriptions – since I want to show them on the final visual.

Visualizing

Next up was visualization – I thought how can we show this. I quickly sketched a Mockup.

Mockup

Then I started to work. I love D3 for visualizations and Tarek had just written a fabulous tutorial on how to draw arcs in d3. So I set out…

I downloaded the data as CSV and used d3.csv to load the data…. Next I defined the scale for the angles – for this I had to know the total. I used underscore to sum it up and create the scale based on this.

var totale=.reduce(.map(raw, function (x) {
return parseInt(x.Employees) }),function (x,y) { return x+y })

var rad=d3.scale.linear()
.domain([0,totale])
.range([0,2*Math.PI]);

Perfect – next, I needed to convert the data to define my arc formula and do start and stop ranges…

var arc = d3.svg.arc()
.innerRadius(ri)
.outerRadius(ro)
.startAngle(function(d){return rad(d.start);})
.endAngle(function(d){return rad(d.end);});

data=[];
sa=0;
_.each(raw, function(d) {
data.push({"department":d.Department,
"description":d.Description,
"start":sa,
"end":sa+parseInt(d.Employees),
"home":parseInt(d.Home)})
sa=sa+parseInt(d.Employees);
})

Great – this allowed me to define a graph and draw the first set of arcs…

svg=d3.select("#graph")
.append("svg")
.attr("width",width)
.attr("height",height);

g=svg.append("g")
.attr("transform","translate("+[width/2,height/2]+")")

depts=g.selectAll("g.depts")
.data(data)
.enter()
.append("g")
.attr("class","depts")

depts.append("path")
.attr("class","total")
.attr("d",arc)
.attr("style", function(d) { return "fill: "+
colors(d.department)})

You’ll notice how I created a group for the whole graph (to translate it to the center) and a group for each department. I want to use the department groups to have both the total employees and the employees still working…

Next, I wanted to draw another arc on top of the arcs for the employees still working. This looked easy at first – but we want our visualization to represent the percentages also in percentage area right? So we can’t just say: Oh if there’s 50% working we just draw a line in the middle between inner and outer radius. What we need is to calculate the second radius (using a quite complicated formula (it took me a while to deduct and I thought I should be able to do that much maths…))

The formula is implemented here:

var rd=function(d) {
var rho=rad(d.end-d.start);
var i=0.5Math.pow(ri,2)rho;
var p=(100-d.home)/100.0;
x2=Math.pow(ro,2)p-(ip-i)/(0.5*rho)
return Math.sqrt(x2);
}

I’ll need another arc function and then I can draw the arcs:

var arcso = d3.svg.arc()
.innerRadius(ri)
.outerRadius(function(d) {return rd(d) })
.startAngle(function(d){return rad(d.start);})
.endAngle(function(d){return rad(d.end);});

depts.append("path")
.attr("class","still")
.attr("d",arcso)
.attr("style", function(d) {
return "fill: "+
colors(d.department)})

Perfect – some styling later this already looks good. The last thing i needed to add was the hovers (done here) and we’re done:

See the Full code on github!

Flattr this!

Exploratory Data Analysis – A Short Example Using World Bank Indicator Data

- July 7, 2013 in Data Stories, HowTo

Knowing how to get started with an exploratory data analysis can often be one of the biggest stumbling blocks if a data set is new to you, or you are new to working with data. I recently came across a powerful example from Al Essa/@malpaso where he illustrates one way in to exploring a new data set – explaining a set of apparent outliers in the data. (Outliers are points that are atypical compared to the rest of data, in this example by virtue of taking on extreme values compared to other data points collected at the same time.)

The case refers to an investigation of life expectancy data obtained from the World Bank (World Bank data sets: life expectancy at birth*), and how Al tried to find what might have caused an apparent crash in life expectancy in Rwanda during the 1990s: The Rwandan Tragedy: Data Analysis with 7 Lines of Simple Python Code

*if you want to download the data yourself, you will need to go into the Databank page for the indicator, then make an Advanced Selection on the Time dimension to select additional years of data.

world bank data

The environment that Al uses to analyse the data in the case study is iPython Notebook, an interactive environment for editing Python code within the browser. (You can download the necessary iPython application from here (I installed the Anaconda package to try it), and then followed the iPython Notebook instructions here to get it running. It’s all a bit fiddly, and could do with a simpler install and start routine, but if you follow the instructions it should work okay…)

Ipython notebook

iPython is not the only environment that supports this sort of exploratory data analysis, of course. For example, we can do a similar analysis using the statistical programming language R, and the ggplot2 graphics library to help with the chart plotting. To get the data, I used a special R library called to WDI that provides a convenient way of interrogating the World Bank Indicators API from within R, and makes it easy to download data from the API directly.

I have posted an example of the case study using R, and the WDI library, here: Rwandan Tragedy (R version). The report was generated form a single file written using a markup language called R markdown in the RStudio environment. R markdown provides a really powerful workflow for creating “reproducible reports” that combine analysis scripts with interpretive text (RStudio – Using Markdown). You can find the actual R markdown script used to generate the Rwanda Tragedy report here.

As you have seen, exploratory data analysis can be thought of as having a conversation with data, asking it questions based on what answers it has previously told you, or based on hypotheses you have made using other sources of information or knowledge. If exploratory data analysis is new to you, try walking through the investigation using either iPython or R, and then see if you can take it further… If you do, be sure to let us know how you got on via the comments:-)

Flattr this!

Using SQL for Lightweight Data Analysis

- March 26, 2013 in Data Blog, Data Cleaning, Data Stories, HowTo, SQL

This article introduces the use of SQL for lightweight data analysis by walking through a small data investigation to answer the question: who were the top recipients of Greater London Authority spending in January 2013?

Along the way, it not only introduces SQL (and SQLite) but illustrates various other skills such as locating and cleaning data and how to load tabular data into a relational database.

Note: if you are intrigued by the question or the data wrangling do check out the OpenSpending project – the work described here was part of some recent work by OpenSpending community members at a recent Open Data Maker Night.

Finding the Data

First we need to locate the data online. Let’s start with a web search, e.g.: “London GLA spending” (GLA = greater london authority). This quickly yields the jackpot in the form of this web page:

For our work, we’ll focus on the latest month. So jump in and grab the CSV file for February which is at the top of that page (at the moment!).

Preparing the Data

The data looks like this (using the Chrome CSV Viewer extension):

gla-csv

Unfortunately, it’s clear these files have a fair amount of “human-readable” cruft that make them unsuitable for further processing without some cleaning and preparation. Specifically:

  • There is various “meta” information plus a blank linke at the top of each file
  • There are several blank lines at the bottom
  • The leading column is empty

We’ll need to remove these if we want to work with this data properly – e.g. load into OpenSpending, put in a database etc. You could do this by hand in your favourite spreadsheet package but we’ll do this using the classic UNIX command line tools head, tail and sed:

tail -n +7 2012-13-P11-250.csv | head -n -4 | sed "s/^,//g" > 2013-jan.csv

This command takes all lines after the first 6 and before the last 4, strips off the leading “,” and puts it in a new file called 2013-jan.csv. It uses unix pipes to run together these few different operations:

# strip off the first 6 lines
tail -n +7

# strip off the last 4 lines
head -n -4

# remove the lead column in the form of "," at the start of each line
# "^," is a regular expression matching "," at the start of a line ("^"
# matches the start of a line)
sed "s/^,//g"

The result of this is shown in the screenshot below and we’re now ready to move on to the next stage.

gla-csv-cleaned

Analyzing the Data in a Relational Database (SQLite)

Our aim is to work out the top recipients of money. To do this we need sum up the amounts spent by Vendor (Name). For the small amount of data here you could use a spreadsheet and pivot tables. However, I’m going to take a somewhat different approach and use a proper (relational) database.

We’ll be using SQLite, an open-source relational database that is lightweight but fully-featured. So, first check you have this installed (type sqlite or sqlite3 on the command line – if you don’t have it is easy to download and install).

Loading into SQLite

Now we need to load our CSV into SQLite. Here we can take advantage of a short python csv2sqlite script. As its name suggests, this takes a CSV file and loads it into an SQLite DB (with a little bit of extra intelligence to try and guess types). The full listing for this is in the appendix below and you can also download it from a gist here. Once you have it downloaded we can use it:

# this will load our csv file into a new table named "data"
# in a new sqlite database in a file named gla.sqlite
csv2sqlite.py 2013-jan.csv gla.sqlite

Analysis I

Let’s get into the SQLite shell so we can run some SQL:

# note you may need to run sqlite3 rather than sqlite!
sqlite gla.sqlite

Now you will be in the SQLite terminal. Let’s run our query:

sqlite> SELECT "Vendor Name", sum(amount) FROM data
          GROUP BY "Vendor Name"
          ORDER BY SUM(amount) DESC
          LIMIT 20;

How does this work? Well the key thing here is the “GROUP BY” which has a similar function to pivoting in spreadsheets: what it does is group together all the rows with the same value in the “Vendor Name” field. We can then use SELECT to specify fields, or functions of fields that are common or aggregate across all the rows with the same “Vendor Name” value. In this case, we just select the “Vendor Name” and the SUM of the “Amount” field. Lastly, we order the results by the sum (descending – so most first) and limit to only 20 results. The result is as follows:

Vendor Name                          SUM(Amount)
-----------------------------------  -----------
NEWLON HOUSING TRUST                 7540500.0  
ONE HOUSING GROUP                    6655104.0  
L B OF HARINGEY                      6181359.0  
LONDON BOROUGH OF HACKNEY - BSP      5665249.0  
LONDON BOROUGH OF HAVERING           4378650.0  
LONDON BOROUGH OF NEWHAM             3391830.0  
LONDON BOROUGH OF BARKING            2802261.0  
EVERSHEDS                            2313698.54 
METROPOLITAN HOUSING TRUST LIMITED   2296243.0  
BERKELEY PARTNERSHIP HOMES LIMITED   2062500.0  
LONDON BOROUGH OF LAMBETH            1917073.95 
PARADIGM HOUSING GROUP LIMITED       1792068.0  
AMAS LTD                             1673907.5  
VIRIDIAN HOUSING                     1467683.0  
LONDON BOROUGH OF GREENWICH          1350000.0  
CITY OF WESTMINSTER                  1250839.13 
CATALYST HOUSING GROUP LTD            829922.0   
ESTUARY HOUSING ASSOCIATION LIMITED   485157.0   
LOOK AHEAD HOUSING AND CARE           353064.0   
TRANSPORT FOR LONDON                  323954.1   

We could try out some other functions, for example to see the total number of transactions and the average amount we’d do:

sqlite> SELECT "Vendor Name", SUM(Amount), AVG(Amount), COUNT(*)
          FROM data
          GROUP BY "Vendor Name"
          ORDER BY sum(amount) DESC;

Vendor Name                          SUM(Amount)  AVG(Amount)  COUNT(*)  
-----------------------------------  -----------  -----------  ----------
NEWLON HOUSING TRUST                 7540500.0    3770250.0    2         
ONE HOUSING GROUP                    6655104.0    3327552.0    2         
L B OF HARINGEY                      6181359.0    6181359.0    1         
LONDON BOROUGH OF HACKNEY - BSP      5665249.0    1888416.333  3         
LONDON BOROUGH OF HAVERING           4378650.0    4378650.0    1         

This gives us a sense of whether there are many small items or a few big items making up the expenditure.

What we’ve seen so far shows us that (unsurprisingly) GLA’s biggest expenditure is support to other boroughs and to housing associations. One interesting point is the approx £2.3m paid to Eversheds (a City law firm) in January and the £1.7m to Amas Ltd.

Analysis II: Filtering

To get a bit more insight let’s try a crude method to remove boroughs from our list:

sqlite> SELECT "Vendor Name", SUM(Amount) FROM data
          WHERE "Vendor Name" NOT LIKE "%BOROUGH%"
          GROUP BY "Vendor Name"
          ORDER BY sum(amount)
          DESC LIMIT 10;

Here we are using the WHERE clause to filter the results. In this case we are using a “NOT LIKE” clause to exclude all rows where the Vendor Name does not contain “Borough”. This isn’t quite enough, let’s also try to exclude housing associations / groups:

SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE ("Vendor Name" NOT LIKE "%BOROUGH%" AND "Vendor Name" NOT LIKE "%HOUSING%")
  GROUP BY "Vendor Name"
  ORDER BY sum(amount)
  DESC LIMIT 20;

This yields the following results:

Vendor Name                          SUM(Amount)
-----------------------------------  -----------
L B OF HARINGEY                      6181359.0  
EVERSHEDS                            2313698.54 
BERKELEY PARTNERSHIP HOMES LIMITED   2062500.0  
AMAS LTD                             1673907.5  
CITY OF WESTMINSTER                  1250839.13 
TRANSPORT FOR LONDON                  323954.1   
VOLKER FITZPATRICK LTD                294769.74  
PEABODY TRUST                         281460.0   
GEORGE WIMPEY MAJOR PROJECTS          267588.0   
ST MUNGOS                             244667.0   
ROOFF LIMITED                         243598.0   
R B KINGSTON UPON THAMES              200000.0   
FOOTBALL FOUNDATION                   195507.0   
NORLAND MANAGED SERVICES LIMITED      172420.75  
TURNER & TOWNSEND PROJECT MAGAG       136024.92  
BARRATT DEVELOPMENTS PLC              108800.0   
INNOVISION EVENTS LTD                 108377.94  
OSBORNE ENERGY LTD                    107248.5   
WASTE & RESOURCES ACTION PROGRAMME     88751.45   
CB RICHARD ELLIS LTD                   87711.45 

We still have a few boroughs due to abbreviated spelling (Haringey, Richmond, Westminster) but the filter is working quite well. New names are now appearing and we could start to look intro these in more detail.

Some Stats

To illustrate a few additional features of let’s get some overall stats.

The number of distinct suppliers: 283

SELECT COUNT(DISTINCT "Vendor Name") FROM data;

Total amount spent in January: approx £60m (60,448,491)

SELECT SUM(Amount) FROM data;

Wrapping Up

We now have an answer to our original question:

  • The biggest recipient of GLA funds in January was Newlon Housing Trust with £7.5m
  • Excluding other governmental or quasi-governmental entities the biggest recipient was Eversheds, a law firm with £2.4m

This tutorial has shown we can get these answers quickly and easily using a simple relational database. Of course, there’s much more we could do and we’ll be covering some of these in subsequent tutorials, for example:

  • Multiple tables of data and relations between them (foreign keys and more)
  • Visualization of of our results
  • Using tools like OpenSpending to do both of these!

Appendix

Colophon

CSV to SQLite script

Note: this script is intentionally limited by requirement to have zero dependencies and its primary purpose is to act as a demonstrator. If you want real CSV to SQL power check out csvsql in the excellent CSVKit or MessyTables.

SQL

All the SQL used in this article has been gathered together in one script:

.mode column
.header ON
.width 35
-- first sum
SELECT "Vendor Name", SUM(Amount) FROM data GROUP BY "Vendor Name" ORDER BY sum(amount) DESC LIMIT 20;
-- sum with avg etc
SELECT "Vendor Name", SUM(Amount), AVG(Amount), COUNT(*) FROM data GROUP BY "Vendor Name" ORDER BY sum(amount) DESC LIMIT 5;
-- exclude boroughs
SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE "Vendor Name" NOT LIKE "%Borough%"
  GROUP BY "Vendor Name"
  ORDER BY sum(amount) DESC
  LIMIT 10;
-- exclude boroughs plus housing
SELECT "Vendor Name", SUM(Amount) FROM data
  WHERE ("Vendor Name" NOT LIKE "%BOROUGH%" AND "Vendor Name" NOT LIKE "%HOUSING%")
  GROUP BY "Vendor Name"
  ORDER BY sum(amount) DESC
  LIMIT 20;
-- totals
SELECT COUNT(DISTINCT "Vendor Name") FROM data;
SELECT SUM(Amount) FROM data;

Assuming you had this in a file called ‘gla-analysis.sql’ you could run it against the database by doing:

sqlite gla.sqlite < gla-analysis.sql

Flattr this!