The International Aid Transparency Initiative (IATI) collects data from donor organizations on various projects done within countries. Who donors fund is often an interesting question to ask. I will try to explore the donors who publish their data on IATI in this two-part little project in which we take a close look at funding activities in Kenya.
First we will scrape and collect the data. I will use Python as a programming language – but you could do it in any other programming language. (If you don’t know how to program, don’t worry.) I’ll use Gephi to analyze and visualize the network in the next step. Now let’s see what we can find…
The IATI’s data is collected on their data platform, the IATI registry. This site is mainly links to the data in XML format. The IATI standard is an XML-based standard that took policy and data wonks years to develop – but it’s an accepted standard in the field, and many organizations use it. Python, luckily, has good tools to deal with XML. the IATI registry also has an API, with examples on how to use it. It returns JSON, which is great, since that’s even simpler to deal with than XML. Let’s go!
From the examples, we learn that we can use the following URL to query for activities in a specific country (Kenya in our example):
http://www.iatiregistry.org/api/search/dataset?filetype=activity&country=KE&all_fields=1&limit=200
If you take a close look at the resulting JSON, you will notice it’s an array ([]
) of objects ({}
) and that each object has a download_url
attribute. This is the link to the XML report.
First we’ll need to get all the URLs:
>>> import urllib2, json #import the libraries we need
>>> url="http://www.iatiregistry.org/api/search/" + "datasetfiletype=activity&country=KE&all_fields=1&limit=200"
>>> u=urllib2.urlopen(url)
>>> j=json.load(u)
>>> j['results'][0]
That gives us:
{u'author_email': u'[email protected]',
u'ckan_url': u'http://iatiregistry.org/dataset/afdb-kenya',
u'download_url': u'http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIKenyaData.xml',
u'entity_type': u'package',
u'extras': {u'activity_count': u'5',
u'activity_period-from': u'2010-01-01',
u'activity_period-to': u'2011-12-31',
u'archive_file': u'no',
u'country': u'KE',
u'data_updated': u'2013-06-26 09:44',
u'filetype': u'activity',
u'language': u'en',
u'publisher_country': u'298',
u'publisher_iati_id': u'46002',
u'publisher_organization_type': u'40',
u'publishertype': u'primary_source',
u'secondary_publisher': u'',
u'verified': u'no'},
u'groups': [u'afdb'],
u'id': u'b60a6485-de7e-4b76-88de-4787175373b8',
u'index_id': u'3293116c0629fd243bb2f6d313d4cc7d',
u'indexed_ts': u'2013-07-02T00:49:52.908Z',
u'license': u'OKD Compliant::Other (Attribution)',
u'license_id': u'other-at',
u'metadata_created': u'2013-06-28T13:15:45.913Z',
u'metadata_modified': u'2013-07-02T00:49:52.481Z',
u'name': u'afdb-kenya',
u'notes': u'',
u'res_description': [u''],
u'res_format': [u'IATI-XML'],
u'res_url': [u'http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIKenyaData.xml'],
u'revision_id': u'91f47131-529c-4367-b7cf-21c8b81c2945',
u'site_id': u'iatiregistry.org',
u'state': u'active',
u'title': u'Kenya'}
So j['results']
is our array of result objects, and we want to get the
download_url
properties from its members. I’ll do this with a list comprehension.
>>> urls=[i['download_url'] for i in j['results']]
>>> urls[0:3]
[u'http://www.afdb.org/fileadmin/uploads/afdb/Documents/Generic-Documents/IATIKenyaData.xml', u'http://www.ausaid.gov.au/data/Documents/AusAID_IATI_Activities_KE.xml', u'http://www.cafod.org.uk/extra/data/iati/IATIFile_Kenya.xml']
Fantastic. Now we have a list of URLs of XML reports. Some people might say, “Now
you’ve got two problems” – but I think we’re one step further.
We can now go and explore the reports. Let’s start to develop what we want to do
with the first report. To do this we’ll need lxml.etree
, the XML library. Let’s use it to parse the XML from the first URL we grabbed.
>>> import lxml.etree
>>> u=urllib2.urlopen(urls[0]) #open the first url
>>> r=lxml.etree.fromstring(u.read()) #parse the XML
>>> r
<Element iati-activities at 0x2964730>
Perfect: we have now parsed the data from the first URL.
To understand what we’ve done, why don’t you open the XML report
in your browser and look at it? Notice that every activity has its own
tag iati-activity
and below that a recipient-country
tag which tells us which country gets the money.
We’ll want to make sure that we only
include activities in Kenya – some reports contain activities in multiple
countries – and therefore we have to select for this. We’ll do this using the
XPath query language.
>>> rc=r.xpath('//recipient-country[@code="KE"]')
>>> activities=[i.getparent() for i in rc]
>>> activities[0]
<Element iati-activity at 0x2972500>
Now we have an array of activities – a good place to start. If you look closely
at activities
, there are several participating-org
entries with different
roles. The roles we’re interested in are “Funding” and “Implementing”: who
gives money and who receives money. We don’t care right now about amounts.
>>> funders=activities[0].xpath('./participating-org[@role="Funding"]')
>>> implementers=activities[0].xpath('./participating-org[@role="Implementing"]')
>>> print funders[0].text
>>> print implementers[0].text
Special Relief Funds
WORLD FOOD PROGRAM - WFP - KENYA OFFICE
Ok, now let’s group them together, funders first, implementers later. We’ll do
this with a list comprehension again.
>>> e=[[(j.text,i.text) for i in implementers] for j in funders]
>>> e
[[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')]]
Hmm. There’s one bracket too many. I’ll remove it with a reduce…
>>> e=reduce(lambda x,y: x+y,e,[])
>>> e
[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')]
Now we’re talking. Because we’ll do this for each activity in each report, we’ll
put it into a function.
>>> def extract_funders_implementers(a):
... f=a.xpath('./participating-org[@role="Funding"]')
... i=a.xpath('./participating-org[@role="Implementing"]')
... e=[[(j.text,k.text) for k in i] for j in f]
... return reduce(lambda x,y: x+y,e,[])
...
>>> extract_funders_implementers(activities[0])</code>
[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')]
Now we can do this for all the activities!
>>> fis=[extract_funders_implementers(i) for i in activities]
>>> fis
[[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE')], [('African Development Fund', 'KENYA NATIONAL HIGHWAY AUTHORITY')], [('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT')], [('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT')], [('African Development Fund', 'KENYA ELECTRICITY TRANSMISSION CO. LTD')]]
Yiihaaa! But we have more than one report, so we’ll need to create a
function here as well to do this for each report….
>>> def process_report(report_url):
... try:
... u=urllib2.urlopen(report_url)
... r=lxml.etree.fromstring(u.read())
... except urllib2.HTTPError:
... return [] # return an empty array if something goes wrong
... activities=[i.getparent() for i in </code>
... r.xpath('//recipient-country[@code="KE"]')]</code>
... return reduce(lambda x,y: x+y,[extract_funders_implementers(i) for i in activities],[])
...
>>> process_report(urls[0])
Works great – notice how I removed the additional brackets with using another
reduce? Now guess what? We can do this for all the reports! – ready? Go!
>>> fis=[process_report(i) for i in urls]
>>> fis=reduce(lambda x,y: x+y, fis, [])
>>> fis[0:10]
[('Special Relief Funds', 'WORLD FOOD PROGRAM - WFP - KENYA OFFICE'), ('African Development Fund', 'KENYA NATIONAL HIGHWAY AUTHORITY'), ('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT'), ('African Development Fund', 'MINISTRY OF WATER DEVELOPMENT'), ('African Development Fund', 'KENYA ELECTRICITY TRANSMISSION CO. LTD'), ('CAFOD', 'St. Francis Community Development Programme'), ('CAFOD', 'Catholic Diocese of Marsabit'), ('CAFOD', 'Assumption Sisters of Nairobi'), ('CAFOD', "Amani People's Theatre"), ('CAFOD', 'Resources Oriented Development Initiatives (RODI)')]
Good! Now let’s save this as a CSV:
>>> import csv
>>> f=open("kenya-funders.csv","wb")
>>> w=csv.writer(f)
>>> w.writerow(("Funder","Implementer"))
>>> for i in fis:
... w.writerow([j.encode("utf-8") if j else "None" for j in i])
...
>>> f.close()
Now we can clean this up using Refine and examine it with Gephi in part II.