Scraping PDFs with Python and the scraperwiki module
Michael Bauer - August 16, 2013 in Scraping, Tech
While for simple single or double-page tables tabula is a viable option – if you have PDFs with tables over multiple pages you’ll soon grow old marking them.
This is where you’ll need some scripting. Thanks to scraperwikis library (pip install scraperwiki
) and the included function pdftoxml
– scraping PDFs has become a feasible task in python. On a recent Hacks/Hackers event we run into a candidate – that was quite tricky to scrape – I decided to protocol the process here.
import scraperwiki, urllib2
First import the scraperwiki library and urllib2 – since the file we’re using is on a webserver – then open and parse the document…
u=urllib2.urlopen("http://images.derstandard.at/2013/08/12/VN2p_2012.pdf") #open the url for the PDF x=scraperwiki.pdftoxml(u.read()) # interpret it as xml print x[:1024] # let's see what's in there abbreviated...
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE pdf2xml SYSTEM "pdf2xml.dtd">
<pdf2xml producer="poppler" version="0.22.5">
<page number="1" position="absolute" top="0" left="0" height="1263" width="892">
<fontspec id="0" size="8" family="Times" color="#000000"/>
<fontspec id="1" size="7" family="Times" color="#000000"/>
<text top="42" left="64" width="787" height="12" font="0"><b>TABELLE VN2Ap/1
30/07/13 11.38.44 BLATT 1 </b></text>
<text top="58" left="64" width="718" height="12" font="0"><b>STATISTIK ALLER VORNAMEN (TEILWEISE PHONETISCH ZUSAMMENGEFASST, ALPHABETISCH SORTIERT) FÜR NEUGEBORENE KNABEN MIT </b></text>
<text top="73" left="64" width="340" height="12" font="0"><b>ÖSTERREICHISCHER STAATSBÜRGERSCHAFT 2012 - ÖSTERREICH </b></text>
<text top="89" left="64" width="6" height="12" font="0"><b> </b></text>
<text top="104" left="64" width="769" height="12" font="0"><b>VORNAMEN ABSOLUT
%
As you can see above, we have successfully loaded the PDF as xml (take a look at
the PDF by just opening the url given, it should give you an idea how it is
structured).
The basic structure of a pdf parsed this way will always be page
tags
followed by text
tags contianing the information, positioning and font
information. The positioning and font information can often help to get the
table we want – however not in this case: everything is font=”0″ and left=”64″.
We can now use xpath to query our
document…
import lxml r=lxml.etree.fromstring(x) r.xpath('//page[@number="1"]')
[<Element page at 0x31c32d0>]
and also get some lines out of it
r.xpath('//text[@left="64"]/b')[0:10] #array abbreviated for legibility
[<Element b at 0x31c3320>,
<Element b at 0x31c3550>,
<Element b at 0x31c35a0>,
<Element b at 0x31c35f0>,
<Element b at 0x31c3640>,
<Element b at 0x31c3690>,
<Element b at 0x31c36e0>,
<Element b at 0x31c3730>,
<Element b at 0x31c3780>,
<Element b at 0x31c37d0>]
r.xpath('//text[@left="64"]/b')[8].text
u'Aaron * 64 0,19 91 Aim\xe9
1 0,00 959 '
Great – this will help us. If we look at the document you’ll notice that there
are all boys names from page 1-20 and girls names from page 21-43 – let’s get
them seperately…
boys=r.xpath('//page[@number<="20"]/text[@left="64"]/b') girls=r.xpath('//page[@number>"20" and @number<="43"]/text[@left="64"]/b') print boys[8].text print girls[8].text
Aaron * 64 0,19 91 Aimé 1
0,00 959
Aarina 1 0,00 1.156 Alaïa 1
0,00 1.156
fantastic – but you’ll also notice something – the columns are all there,
sperated by whitespaces. And also Aaron has an asterisk – we want to remove it
(the asterisk is explained in the original doc).
To split it up into columns I’ll create a small function using regexes to split
it.
import re
def split_entry(e):
return re.split("[ ]+",e.text.replace("*","")) # we're removing the asterisk here as well...
now let’s apply it to boys and girls
boys=[split_entry(i) for i in boys] girls=[split_entry(i) for i in girls] print boys[8] print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959', u'']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156', u'']
That worked!. Notice the empty string u” at the end? I’d like to filter it.
I’ll do this using the ifilter function from itertools
import itertools boys=[[i for i in itertools.ifilter(lambda x: x!="",j)] for j in boys] girls=[[i for i in itertools.ifilter(lambda x: x!="",j)] for j in girls] print boys[8] print girls[8]
[u'Aaron', u'64', u'0,19', u'91', u'Aim\xe9', u'1', u'0,00', u'959']
[u'Aarina', u'1', u'0,00', u'1.156', u'Ala\xefa', u'1', u'0,00', u'1.156']
Worked, this cleaned up our boys and girls arrays. We want to make them properly
though – there are two columns each four fields wide. I’ll do this with a little
function
def take4(x):
if (len(x)>5):
return [x[0:4],x[4:]]
else:
return [x[0:4]]
boys=[take4(i) for i in boys]
girls=[take4(i) for i in girls]
print boys[8]
print girls[8]
[[u'Aaron', u'64', u'0,19', u'91'], [u'Aim\xe9', u'1', u'0,00', u'959']]
[[u'Aarina', u'1', u'0,00', u'1.156'], [u'Ala\xefa', u'1', u'0,00', u'1.156']]
ah that worked nicely! – now let’s make sure it’s one array with both options in
it -for this i’ll use reduce
boys=reduce(lambda x,y: x+y, boys, []) girls=reduce(lambda x,y: x+y, girls,[]) print boys[10] print girls[10]
['Aiden', '2', '0,01', '667']
['Alaa', '1', '0,00', '1.156']
perfect – now let’s add a gender to the entries
for x in boys:
x.append("m")
for x in girls:
x.append("f")
print boys[10]
print girls[10]
['Aiden', '2', '0,01', '667', 'm']
['Alaa', '1', '0,00', '1.156', 'f']
We got that! For further processing I’ll join the arrays up
names=boys+girls print names[10]
['Aiden', '2', '0,01', '667', 'm']
let’s take a look at the full array…
names[0:10]
[['TABELLE', 'VN2Ap/1', '30/07/13', '11.38.44', 'm'],
['BLATT', '1', 'm'],
[u'STATISTIK', u'ALLER', u'VORNAMEN', u'(TEILWEISE', 'm'],
[u'PHONETISCH',
u'ZUSAMMENGEFASST,',
u'ALPHABETISCH',
u'SORTIERT)',
u'F\xdcR',
u'NEUGEBORENE',
u'KNABEN',
u'MIT',
'm'],
[u'\xd6STERREICHISCHER', u'STAATSB\xdcRGERSCHAFT', u'2012', u'-', 'm'],
['m'],
['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
['VORNAMEN', 'ABSOLUT', '%', 'RANG', 'm'],
['m'],
['INSGESAMT', '34.017', '100,00', '.', 'm']]
Notice there is still quite a bit of mess in there: basically all the lines
starting with an all caps entry, “der”, “m” or “f”. Let’s remove them….
names=itertools.ifilter(lambda x: not x[0].isupper(),names) # remove allcaps entries names=[i for i in itertools.ifilter(lambda x: not (x[0] in ["der","m","f"]),names)] # remove all entries that are "der","m" or "f" names[0:10]
[['Aiden', '2', '0,01', '667', 'm'],
['Aiman', '3', '0,01', '532', 'm'],
[u'Aaron', u'64', u'0,19', u'91', 'm'],
[u'Aim\xe9', u'1', u'0,00', u'959', 'm'],
['Abbas', '2', '0,01', '667', 'm'],
['Ajan', '2', '0,01', '667', 'm'],
['Abdallrhman', '1', '0,00', '959', 'm'],
['Ajdin', '15', '0,04', '225', 'm'],
['Abdel', '1', '0,00', '959', 'm'],
['Ajnur', '1', '0,00', '959', 'm']]
Woohoo – we have a cleaned up list. Now let’s write it as csv….
import csv
f=open("names.csv","wb") #open file for writing
w=csv.writer(f) #open a csv writer
w.writerow(["Name","Count","Percent","Rank","Gender"]) #write the header
for n in names:
w.writerow([i.encode("utf-8") for i in n]) #write each row
f.close()
Done, We’ve scraped a multi-page PDF using python. All in all this was a fairly
quick way to get the data out of a PDF using the scraperwiki module.