Protocol Buffers

“…Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages….”. That’s the definition of Protocol Buffers on official web page. Beside Apache Avro and Apache Thrift, this is one of the most popular data serialization system\framework used for big data containers. In this post, I will use my preferable programming language (yes.. python), to write (serialize) and read (deserialize) protobuf data.

We need to start from defining protocol buffer message schema. Here you define field types, if field is optional or required and if some message might be repeated. It’s quite important to spend some time on this definition, as going forward this structure will impose the way how you write\read\call data in your code.

Below you can find my message schema. It was defined to store City information downloaded from simplemaps web page.

cities.proto:

syntax = “proto2”;

package gisdatascience;

message City_prop {
required string name = 1;
required float lat = 2;
required float lon = 3;
required string country = 4;
required int32 population = 5;

enum ISOcodeType {
ISO2 = 0;
ISO3 = 1;
}

message ISOcode {
required string code = 1;
optional ISOcodeType type = 2 [default = ISO2];
}

repeated ISOcode isocods = 6;

}
message Cities {
repeated City_prop City = 1;
}

In order to work with this defined schema, you need to generate the classes allowing you to read and write Cities messages. To do this, you need to run the protocol buffer compiler protoc on your .proto, but first you need to install it:

  1. Download precompiled binary version of the protocol buffer compiler (protoc) from https://github.com/google/protobuf/releases: “protoc-3.4.0-win32.zip” and add this location to your PATH environment variable

  2. Install protobuf module (conda install protobuf).

  3. Run protoc compiler on .proto file (protoc -I=source_dir –python_out=dest_dir source_dir/cities.proto).

As the result of this compilation you will get cities_pb2.py file with descriptors allowing you to work with Cities class (you should import this file to your program).

Alright, now we are ready to write python code. Let’s start from loading our data to pandas dataframe and then adding it row by row to Cities object.

import cities_pb2
import pandas as pd
csv_file =pd.read_csv('simplemaps-worldcities-basic.csv', sep=',', header=0)
csv_file.head()
city city_ascii lat lng pop country iso2 iso3 province
0 Qal eh-ye Now Qal eh-ye 34.983000 63.133300 2997.0 Afghanistan AF AFG Badghis
1 Chaghcharan Chaghcharan 34.516701 65.250001 15000.0 Afghanistan AF AFG Ghor
2 Lashkar Gah Lashkar Gah 31.582998 64.360000 201546.0 Afghanistan AF AFG Hilmand
3 Zaranj Zaranj 31.112001 61.886998 49851.0 Afghanistan AF AFG Nimroz
4 Tarin Kowt Tarin Kowt 32.633298 65.866699 10000.0 Afghanistan AF AFG Uruzgan
# Open or create serialized file containing list of cities
serialized_file = open('Cities_proto', "wb")
# Create empty Cities object
Cities = cities_pb2.Cities()
#iterate through csv_file dataframe and write row by row into Cities object
for i, row in csv_file.iterrows():
    new_city=Cities.City.add()
    new_city.name=csv_file.iloc[i]['city_ascii']
    new_city.lat=csv_file.iloc[i]['lat']
    new_city.lon=csv_file.iloc[i]['lng']
    new_city.country=csv_file.iloc[i]['country']
    new_city.population=int(csv_file.iloc[i]['pop'])
    iso = new_city.isocods.add()
    iso.code = str(csv_file.iloc[i]['iso2'])
    iso.type = cities_pb2.City_prop.ISO2
    iso = new_city.isocods.add()
    iso.code = str(csv_file.iloc[i]['iso3'])
    iso.type = cities_pb2.City_prop.ISO3
# Diplay 2 Cities from Cities object
Cities.City[0:2]
[name: "Qal eh-ye"
 lat: 34.98300013
 lon: 63.13329964
 country: "Afghanistan"
 population: 2997
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }, name: "Chaghcharan"
 lat: 34.5167011
 lon: 65.25000063
 country: "Afghanistan"
 population: 15000
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }]

We have successfully transferred all cities from dataframe to Cities object, now lets store this object is serialized file:

#serialize message to string and write to serialized_file
serialized_file.write(Cities.SerializeToString())
serialized_file.close()

Cool, but there wouldn’t be much sense in storing data in serialized data, if we are not able to deserialize it back to human readable format…Lets’ do it…

# Reading data from serialized_file
serialized_file= open('Cities_proto', "rb")
Cities_read = cities_pb2.Cities()
Cities_read.ParseFromString(serialized_file.read())
Cities_read.City[0:2]
[name: "Qal eh-ye"
 lat: 34.983001708984375
 lon: 63.13330078125
 country: "Afghanistan"
 population: 2997
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }, name: "Chaghcharan"
 lat: 34.516700744628906
 lon: 65.25
 country: "Afghanistan"
 population: 15000
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }]

The final and the best way of checking if spatial data is not corrupted is to display it on the map (that is not true, but let’s do it for fun 😉 ), so I will create a simple function displaying given city on Folium map (If you don’t know how to do it, here is my post on that).

# Function finding city and displaying it on the foluim map
import folium
%matplotlib inline

def find_city(country_name):
    map = folium.Map(tiles='Mapbox Bright')
    found=0
    for i in Cities_read.City:
        if i.name==country_name:
            folium.Marker([i.lat,i.lon], popup=i.name+\
                          "<br><i>Country: "+str(i.country)+"</i>"\
                          "<br><i>Population: "+str(i.population)+"</i>").add_to(map)
            map.location=[i.lat,i.lon]
            map.zoom_start=5
            found=1
    if found==0:
        print ('City not found in our DB')
    return map
find_city('Krakow')

Krakow location looks good to me, and if you press marker icon you should see info. like: country and population (which also look fine..)

serialized_file.close()

And that’s it! So now, it’s time for avro format (neah, it’s rather time to read some book, and stop looking at monitor.. )
As always, you can download this code from my github .