Distributed spatial analytics with Ray

Ray is an open source, easy to use library for distributed computing with python. In my opinion, the biggest advantage of Ray is that python code might be parallelized as it is, so code do not have to be written in a specific way. Functions might be used as tasks and classes might be used as actors. If your process has multiple functions that produce some data and pass this data to other functions, then it’s better to use classes\actors. That will reduce sending data between workers and ideally the entire process and it’s data flow should be executed by one actor. In this post I will compare running times of spatial analytics from
my previous post about geopandas, run as a single process vs. distributed processing with Ray.

Following class will be used for single processing. You will notice later on, that exactly the same class just with additional decorator will be used for parallel processing.

Single processing running time is 165.7 sec.

Now it’s time to see ray in action. Like I said before, our code can be used as it is. The only difference is additional decorator before class definition: @ray.remote

Next function will execute our distributed process, and here is few methods which are specific to ray.
ray.init() call starts a Ray runtime on your machine. We can specify number of resources that we want to use i.e. number of cpus or gpus, but for this example, we will allow ray to decide about resources to be used. At the end of your script you will find ray.shutdown(), which terminates Ray runtime.
spatial_actor.remote() – this spatial_actor class object call creates our actor. We have 16 L2 admin regions in Poland, thus we have 16 actors.
To fetch the result of processing from each actor, you need to use get call i.e. ray.get([actor1.read_admin.remote(), actor2.read_admin.remote(),actor3.read_admin.remote(),…..])

Running time of our process with Ray is more than 3 times faster than singles processing approach.

As you can see, in our case, the same code with Ray works 3 times faster. In this scenario, although
we use distributed execution, entire calculation was done on our machine. Imagine what we can achieve scaling it further with using some cloud solutions i.e. AWS EMR….