Ex1 - ACLED downloader

Introduction

As a GIS officer in a non governmental organisation, the field operations unit will ask you to produce daily reports of events in Africa, south Asia and the middle east. For that, you will use the freely accessible ACLED data.

Armed Conflict Location & Event Data Project (ACLED) is a disaggregated conflict collection, analysis and crisis mapping project. ACLED collects the dates, actors, types of violence, locations, and fatalities of all reported political violence and protest events across Africa, South Asia, South East Asia and the Middle East. Political violence and protest includes events that occur within civil wars and periods of instability, public protest and regime breakdown.

Objectives of the exercise

In this workspace, your task is to download all the events reported for the current year and gather them into an HTML report.

Final result can be seen here :

https://s3.eu-central-1.amazonaws.com/training.insersa.ch/FME/geopython2019/results/acled_report_output.html

Description of the workspace

The data will be taken from a web service. Therefore this workspace does not have a Reader. It is triggered by a Creator.

First, a period is defined :

The URL of the ACLED web service is defined here :

if you open the AttributeCreator, you will see that the url is defined as follow :


https://api.acleddata.com/acled/read?terms=accept&year=@Value(year)&region=1

where @Value(year) is a variable that is defined with the attribute "year".

Then the JSON is downloaded with a HTTPCaller which takes the URL defined previously :

The result obtained from the web service is then parsed : And the geometry is created :

Finally, the report is set up and export as an HTML file :

The problem

Yes, there is a problem ! Have you noticed it ?

This workspace works well, however if you analyse the number of features included in the report, you haven't downloaded all the events of the year.

In fact, the URL :


https://api.acleddata.com/acled/read?terms=accept&region=1&year=2019

returns only 500 features.

Then how can we add more recent features ?

Feel free to have a look at the API description here

The API description explains that we can add a "page" parameter in order to specify which page we would like to retrieve.


https://api.acleddata.com/acled/read?terms=accept&region=1&year=2019&page=1

or, to get a part of the missing features :


https://api.acleddata.com/acled/read?terms=accept&region=1&year=2019&page=1

The returned JSON has the following schema :


{

"success": true,

"last\\_update": 91,

"count": 500,

"data": [...],

   "filename": "2018-04-06"

}

The features information is included in the data element. We also have a count element that gives us the number of features in the JSON. Therefore we can use this number to define the following loop :

And guess what, Python will help us to do that !

Exercice

API documentation and the workspace start

Start the workspace by double clicking on "C:\FME_data\acled\acled_download_start.fmw"

Add a PythonCaller

Just after the URL definition (AttributeCreator_2), add a “PythonCaller” by typing in the canvas "PythonCaller"

The pythonCaller must be placed between the AttributeCreator_2 and the HTTPCaller as follow :

Open the PythonCaller

If you double click on the PythonCaller, the following window will appear :

This window contains a basic template for your python script. For the purpose of this exercise, we will work with the "FeatureProcessor" class.

Some basic explainations :

The "init" function is automatically run before the first feature enters the PythonCaller
The "input" function is run one time for each feature
The "close" function is run when all the features have passed through the PythonCaller
self.pyoutput(feature) will tell FME to push the object 'feature' to the next transformer (to exit the PythonCaller)

Modify the script to end up with :


import fme

import fmeobjects

class FeatureProcessor(object):

    def __init__(self):

        pass

    def input(self,feature):

        pass

    def close(self):

        pass

Import modules

For the exercise, we first need to downlaod JSON data from the API and then parse it. Therefore, the urllib module and the json module must be imported.

Add the following lines at the beginning of the script :


import urllib.request

import json

Add variables

We will fist create the "page_number" variable and the "url" variable under the input function (remove the _pass _statement)


def input(self, feature):

        page_number = 0

        url = feature.getAttribute('url')

The "url" variable comes from the value of one attribute created before the _PythonCaller. Attributes values can be inserted into a python code as variabales by writing "feature.getAttribute("") or more quickly just by double clicking on the attribute name in the left panel :

Add a "while True" loop

Let's now define the iterative loop. This while true loop will run until something stops it.

The breaking point will be when the number of features included in the answer is bellow the 500 limit.

To sum this up a little bit, for one iteration :

if the features numbers equal 500 (the max number per pages), it means that it is not the final page -> no break
if the features numbers are smaller than 500, it means that we reached the last page -> break

break just means we stop the iteration (the loop)

Add the following lines under the input function :


        while True:

            page_number += 1

            new_url = "%s&page=%s" %(url, page_number)

For each iteration, one is added to the number of pages and the new url is defined.

Watch out: if you run the workspace now, the PythonCaller will run infinitely.

Get the number of features per page

It is time now to download data. Add the following lines, still under the input function :


            # define the number of features

            response = urllib.request.urlopen(feature.getAttribute('url')).read()       

            data = json.loads(response.decode('utf-8'))            

            number_of_features = data['count']

            fmeobjects.FMELogFile().logMessageString("--> The page %s contains %s features" % (page_number, number_of_features))

The variable "data" is now a python dictionary. The "number_of_features" variable retrieves the "count" key of the dictionnary.

Finally, the logMessageString() function writes information in the log file and log window below the FME canvas. You should end up with something like this:

Create the URL

The new url is created here and ouput from the PythonCaller.


            feature.setAttribute('url', new_url)

            self.pyoutput(feature)

Add the loop exit

To stop the loop after the last page, add the following lines :


            if number_of_features < 500:

                break

Run the workspace

The final code should look like this :


import fme

import fmeobjects

import urllib.request

import json

class FeatureProcessor(object):

    def __init__(self):

        pass

    def input(self, feature):

        page_number = 0

        url = feature.getAttribute('url')

        while True:

            page_number += 1

            new_url = "%s&page=%s" %(url, page_number)

            # define the number of features

            response = urllib.request.urlopen(feature.getAttribute('url')).read()       

            data= json.loads(response.decode('utf-8'))            

            number_of_features = data['count']

            fmeobjects.FMELogFile().logMessageString("--> The page %s contains %s features" % (page_number, number_of_features))         

            feature.setAttribute('url', new_url)

            self.pyoutput(feature)

            if number_of_features < 500:

                break

    def close(self):

        pass

You are now ready to run the workspace and create the full report !

The next step would be to uplaod the workspace to FME Server for daily scheduling, but this goes beyond the scope of the exercise.

Advanced task

You may have noticed that we downloaded the same data twice. That's not really efficient in terms of processing time. Could you please find a solution to avoid this ?

Hint : You can either output the JSON string from the PythonCaller and delete the HTTPCaller or directly parse the JSON within the PythonCaller. In this second case, you could remove all transformers until the VertexCreator (which can also be inserted into the PythonCaller). More on that in the following exercise.