Ex1 - ACLED downloader
Introduction
As a GIS officer in a non governmental organisation, the field operations unit will ask you to produce daily reports of events in Africa, south Asia and the middle east. For that, you will use the freely accessible ACLED data.
Armed Conflict Location & Event Data Project (ACLED) is a disaggregated conflict collection, analysis and crisis mapping project. ACLED collects the dates, actors, types of violence, locations, and fatalities of all reported political violence and protest events across Africa, South Asia, South East Asia and the Middle East. Political violence and protest includes events that occur within civil wars and periods of instability, public protest and regime breakdown.
Objectives of the exercise
In this workspace, your task is to download all the events reported for the current year and gather them into an HTML report.
Final result can be seen here :
Description of the workspace
The data will be taken from a web service. Therefore this workspace does not have a Reader. It is triggered by a Creator.
First, a period is defined :
The URL of the ACLED web service is defined here :
if you open the AttributeCreator, you will see that the url is defined as follow :
https://api.acleddata.com/acled/read?terms=accept&year=@Value(year)®ion=1
where @Value(year) is a variable that is defined with the attribute "year".
Then the JSON is downloaded with a HTTPCaller which takes the URL defined previously :
The result obtained from the web service is then parsed : And the geometry is created :
Finally, the report is set up and export as an HTML file :
The problem
Yes, there is a problem ! Have you noticed it ?
This workspace works well, however if you analyse the number of features included in the report, you haven't downloaded all the events of the year.
In fact, the URL :
https://api.acleddata.com/acled/read?terms=accept®ion=1&year=2019
returns only 500 features.
Then how can we add more recent features ?
Feel free to have a look at the API description here
The API description explains that we can add a "page" parameter in order to specify which page we would like to retrieve.
https://api.acleddata.com/acled/read?terms=accept®ion=1&year=2019&page=1
or, to get a part of the missing features :
https://api.acleddata.com/acled/read?terms=accept®ion=1&year=2019&page=1
The returned JSON has the following schema :
{
"success": true,
"last\\_update": 91,
"count": 500,
"data": [...],
"filename": "2018-04-06"
}
The features information is included in the data element. We also have a count element that gives us the number of features in the JSON. Therefore we can use this number to define the following loop :
And guess what, Python will help us to do that !
Exercice
API documentation and the workspace start
Start the workspace by double clicking on "C:\FME_data\acled\acled_download_start.fmw"
Add a PythonCaller
Just after the URL definition (AttributeCreator_2), add a “PythonCaller” by typing in the canvas "PythonCaller"
The pythonCaller must be placed between the AttributeCreator_2 and the HTTPCaller as follow :
Open the PythonCaller
If you double click on the PythonCaller, the following window will appear :
This window contains a basic template for your python script. For the purpose of this exercise, we will work with the "FeatureProcessor" class.
Some basic explainations :
- The "init" function is automatically run before the first feature enters the PythonCaller
- The "input" function is run one time for each feature
- The "close" function is run when all the features have passed through the PythonCaller
- self.pyoutput(feature) will tell FME to push the object 'feature' to the next transformer (to exit the PythonCaller)
Modify the script to end up with :
import fme
import fmeobjects
class FeatureProcessor(object):
def __init__(self):
pass
def input(self,feature):
pass
def close(self):
pass
Import modules
For the exercise, we first need to downlaod JSON data from the API and then parse it. Therefore, the urllib module and the json module must be imported.
Add the following lines at the beginning of the script :
import urllib.request
import json
Add variables
We will fist create the "page_number" variable and the "url" variable under the input function (remove the _pass _statement)
def input(self, feature):
page_number = 0
url = feature.getAttribute('url')
The "url" variable comes from the value of one attribute created before the _PythonCaller. Attributes values can be inserted into a python code as variabales by writing "feature.getAttribute("
Add a "while True" loop
Let's now define the iterative loop. This while true loop will run until something stops it.
The breaking point will be when the number of features included in the answer is bellow the 500 limit.
To sum this up a little bit, for one iteration :
- if the features numbers equal 500 (the max number per pages), it means that it is not the final page -> no break
- if the features numbers are smaller than 500, it means that we reached the last page -> break
break just means we stop the iteration (the loop)
Add the following lines under the input function :
while True:
page_number += 1
new_url = "%s&page=%s" %(url, page_number)
For each iteration, one is added to the number of pages and the new url is defined.
Watch out: if you run the workspace now, the PythonCaller will run infinitely.
Get the number of features per page
It is time now to download data. Add the following lines, still under the input function :
# define the number of features
response = urllib.request.urlopen(feature.getAttribute('url')).read()
data = json.loads(response.decode('utf-8'))
number_of_features = data['count']
fmeobjects.FMELogFile().logMessageString("--> The page %s contains %s features" % (page_number, number_of_features))
The variable "data" is now a python dictionary. The "number_of_features" variable retrieves the "count" key of the dictionnary.
Finally, the logMessageString() function writes information in the log file and log window below the FME canvas. You should end up with something like this:
Create the URL
The new url is created here and ouput from the PythonCaller.
feature.setAttribute('url', new_url)
self.pyoutput(feature)
Add the loop exit
To stop the loop after the last page, add the following lines :
if number_of_features < 500:
break
Run the workspace
The final code should look like this :
import fme
import fmeobjects
import urllib.request
import json
class FeatureProcessor(object):
def __init__(self):
pass
def input(self, feature):
page_number = 0
url = feature.getAttribute('url')
while True:
page_number += 1
new_url = "%s&page=%s" %(url, page_number)
# define the number of features
response = urllib.request.urlopen(feature.getAttribute('url')).read()
data= json.loads(response.decode('utf-8'))
number_of_features = data['count']
fmeobjects.FMELogFile().logMessageString("--> The page %s contains %s features" % (page_number, number_of_features))
feature.setAttribute('url', new_url)
self.pyoutput(feature)
if number_of_features < 500:
break
def close(self):
pass
You are now ready to run the workspace and create the full report !
The next step would be to uplaod the workspace to FME Server for daily scheduling, but this goes beyond the scope of the exercise.
Advanced task
You may have noticed that we downloaded the same data twice. That's not really efficient in terms of processing time. Could you please find a solution to avoid this ?
Hint : You can either output the JSON string from the PythonCaller and delete the HTTPCaller or directly parse the JSON within the PythonCaller. In this second case, you could remove all transformers until the VertexCreator (which can also be inserted into the PythonCaller). More on that in the following exercise.