Marco Túlio Pires

Tackling PDFs with Tabula

Marco Túlio Pires - March 27, 2014 in Scraping

School of Data mentor Marco Túlio Pires has been writing for our friends at Tactical Tech about journalistic data investigation. This post introduces a tool for extracting data from PDFs, and it was originally published on Exposing the Invisible‘s resources page.

PDF files are pesky. If you copy and paste a table from a PDF into a new document, the result will be messy and ugly. You either have to type the data by hand into your spreadsheet processor or use an app to do that for you. Here we will walk through a tool that does it automatically: Tabula.

Tabula is awesome because it’s free and works on all major operating systems. All you have to do is download the zip file and extract a folder. You don’t even have to install anything, provided you’ve got Java on your machine, which is very likely. It works in your browser, and it’s all about visual controls. You use your mouse to select data you want to convert, and you download it converted to CSV. Simple as that.

Though Tabula is still experimental software, it is really easy to use. Once you fire it up, it will open your web browser with a welcome screen. Submit your PDF file and Tabula will process your file and show you a nice list of page thumbnails. Look for the table you want to extract.

Finding the table

Click and drag to select the area of the table. Once you release, Tabula will show you the extracted data in a friendly format. If the data is fuzzy, try selecting a narrower area, try removing the headers or the footnote, etc. Play around with it a bit. You can either download it as CSV and open it in a with spreadsheet program or copy the data to the clipboard.

Extracted data

Once the data is in a spreadsheet, you may need to do a bit of editing such as correcting the headers. Tabula won’t be perfect 100% of the time – it’s very good with numbers, but it can get confused by multi-line headers.

For further information and resources on Tabula, see:

Tabula has a video that walks through how to set up the tool.
Read the Open News article introducing Tabula and setting out some of its limitations.
Check out the School of Data Link discussion on Tabula and some alternatives.

No Comments »

Learning to Listen to your Data

Marco Túlio Pires - March 27, 2014 in Data Stories

School of Data mentor Marco Túlio Pires has been writing for our friends at Tactical Tech about journalistic data investigation. This post “talks us through how to begin approaching and thinking about stories in data”, and it was originally published on Exposing the Invisible‘s resources page.

Journalists used to carefully establish relationships with sources in the hope of getting a scoop or obtaining a juicy secret. While we still do that, we now have a new source which we interrogate for information: data. Datasets have become much like those real sources – someone (or something!) that holds the key to many secrets. And as we begin to treat datasets as sources, as if they were someone we’d like to interview, to ask meaningful and difficult questions to, they start to reveal their stories, and more often than not, we come across tales we weren’t even looking for.

But how do we do it? How can we find stories buried underneath a pile of raw data? That’s what this post will try to show you: the process of understanding your data and listening to what your “interviewee” is trying to tell you. And instead of giving you a lecture about the ins and outs of data analysis, we’ll walk you through an example.

Let’s take an example from the The Guardian, the British newspaper that has a very active data-driven operation. We’re going to (try to) “reverse engineer” one of their stories in the hopes you get a glimpse at what happens when you go after information that you have to compile, clean, and analyse and what kind of decisions we make along the way to tell a story out of a dataset.

So, let’s talk about immigration. Every year, the Department of Immigration and Border Protection of Australia publishes a bunch of documents about immigration statistics down under. Published last year, the team at The Guardian focused on a report called Asylum Trends for 2011-2012. There’s a more up-to-date version available (2012-2013). By the end of this exercise, we hope you can use the newer version to compare it with the dataset used by The Guardian. Let us know in the comments about your findings.

The article starts with a broad question: does Australia have a problem with refugees? That’s the underlying question that helps makes this story relevant. It’s useful to start a data-driven investigation with a question, something that bothers you, something that doesn’t seem quite right, something that might be an issue for a lot of people.

With that question in mind, I quickly found a table on page 2 with the total number of people seeking protection in Australia.

People seeking Austria's protection

Let’s make a chart out of this and see what the trend is. Because this is a pesky PDF file, you’ll need to either type the data by hand into your spreadsheet processor or use an app to do that for you. For a walkthrough of a tool that does this automatically, see the Tabula example here.

After putting the PDF into Tabula this is what we get (data was imported into OpenOffice Calc):

Tabula

I opened the CSV file in OpenOffice Calc and edited it a bit to make it clearer. Let’s see how the number of people seeking Australia’s protection has changed over the years. Using the Chart feature in the spreadsheet, we can compare columns A and D by making a line chart.

Line chart

Take a good look at this chart. What’s happening here? On the vertical axis, we see the total number of people asking for Australia’s protection. On the horizontal axis, we see the timeline year by year. Between 2003 and 2008, there’s no significant change. But something happened from 2009 on. By the end of the series, it’s almost three times higher. Why? We don’t know yet. Let’s take a look at other data from the PDF and use Tabula to import it to our spreadsheet. Maybe that will show us what’s going on.

Australia divides their refugees into two groups: those who arrived by boat and those who arrived by air. They use the acronyms IMA and non-IMA (IMA stands for Irregular Maritime Arrivals). Let’s compare the totals of the two groups and see how they relate across the years presented in this report. Using Table 4 and Table 25, we’ll create a new table that has the totals for the two groups. Be careful, though, the non-IMA table goes back up to 2007, but the IMA table goes only as far as 2008. Let’s create a line chart with this data.

What’s that? It seems that in 2011-2012, for the first time in this time series, the number of refugees arriving in Australia by boat surpassed those landing by plane. The next question could be: where are all the IMA refugees coming from? We already have the data from table 25. Let’s make a chart out of that, considering the period 2011-2012. That would be columns A and E of our data. Here’s a donut chart with the information:

Donut chart

Afghanistan (deep blue) and Iran (orange) alone represent more than 64% of all IMA refugees in Australia in 2011-2012.

From here, there are a lot of routes we could take. We could use the report to take a look at the age of the refugees, like the folks at The Guardian did. We could compare IMA and non-IMA countries and see if there’s a stark difference and, if so, ask why that’s the case. We could look at why Afghans and Iranians are travelling by boat and not plane, and what risks they face as a result. How does the data in this report compare with the data from the more recent report? The analysis could be used to come up with a series of questions to ask to the Australian government or a specialist on immigration.

Whatever the case might be, it’s worth remembering that finding stories in data should never be an activity that ends in itself. We’re talking about data that’s built on the behavior of people, on the real world. The data is always connected to something out there, you just need to listen to what your spreadsheet is saying. What do you say? Got data?

No Comments »

You are browsing the archive for Marco Túlio Pires.

Tackling PDFs with Tabula

Learning to Listen to your Data

Со поддршка од

Соработка меѓу

Пребарај