“…Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages….”. That’s the definition of Protocol Buffers on official web page. Beside Apache Avro and Apache Thrift, this is one of the most popular data serialization system\framework used for big data containers. In this post, I will use my preferable programming language (yes.. python), to write (serialize) and read (deserialize) protobuf data.

We need to start from defining protocol buffer message schema. Here you define field types, if field is optional or required and if some message might be repeated. It’s quite important to spend some time on this definition, as going forward this structure will impose the way how you write\read\call data in your code.

Below you can find my message schema. It was defined to store City information downloaded from simplemaps web page.

cities.proto:

syntax = “proto2”;

package gisdatascience;

message City_prop {
required string name = 1;
required float lat = 2;
required float lon = 3;
required string country = 4;
required int32 population = 5;

enum ISOcodeType {
ISO2 = 0;
ISO3 = 1;
}

message ISOcode {
required string code = 1;
optional ISOcodeType type = 2 [default = ISO2];
}

repeated ISOcode isocods = 6;

}
message Cities {
repeated City_prop City = 1;
}

In order to work with this defined schema, you need to generate the classes allowing you to read and write Cities messages. To do this, you need to run the protocol buffer compiler protoc on your .proto, but first you need to install it:

Download precompiled binary version of the protocol buffer compiler (protoc) from https://github.com/google/protobuf/releases: “protoc-3.4.0-win32.zip” and add this location to your PATH environment variable
Install protobuf module (conda install protobuf).
Run protoc compiler on .proto file (protoc -I=source_dir –python_out=dest_dir source_dir/cities.proto).

As the result of this compilation you will get cities_pb2.py file with descriptors allowing you to work with Cities class (you should import this file to your program).

Alright, now we are ready to write python code. Let’s start from loading our data to pandas dataframe and then adding it row by row to Cities object.

import cities_pb2
import pandas as pd

csv_file =pd.read_csv('simplemaps-worldcities-basic.csv', sep=',', header=0)

csv_file.head()

	city	city_ascii	lat	lng	pop	country	iso2	iso3	province
0	Qal eh-ye Now	Qal eh-ye	34.983000	63.133300	2997.0	Afghanistan	AF	AFG	Badghis
1	Chaghcharan	Chaghcharan	34.516701	65.250001	15000.0	Afghanistan	AF	AFG	Ghor
2	Lashkar Gah	Lashkar Gah	31.582998	64.360000	201546.0	Afghanistan	AF	AFG	Hilmand
3	Zaranj	Zaranj	31.112001	61.886998	49851.0	Afghanistan	AF	AFG	Nimroz
4	Tarin Kowt	Tarin Kowt	32.633298	65.866699	10000.0	Afghanistan	AF	AFG	Uruzgan

# Open or create serialized file containing list of cities
serialized_file = open('Cities_proto', "wb")

# Create empty Cities object
Cities = cities_pb2.Cities()

#iterate through csv_file dataframe and write row by row into Cities object
for i, row in csv_file.iterrows():
    new_city=Cities.City.add()
    new_city.name=csv_file.iloc[i]['city_ascii']
    new_city.lat=csv_file.iloc[i]['lat']
    new_city.lon=csv_file.iloc[i]['lng']
    new_city.country=csv_file.iloc[i]['country']
    new_city.population=int(csv_file.iloc[i]['pop'])
    iso = new_city.isocods.add()
    iso.code = str(csv_file.iloc[i]['iso2'])
    iso.type = cities_pb2.City_prop.ISO2
    iso = new_city.isocods.add()
    iso.code = str(csv_file.iloc[i]['iso3'])
    iso.type = cities_pb2.City_prop.ISO3

# Diplay 2 Cities from Cities object
Cities.City[0:2]

[name: "Qal eh-ye"
 lat: 34.98300013
 lon: 63.13329964
 country: "Afghanistan"
 population: 2997
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }, name: "Chaghcharan"
 lat: 34.5167011
 lon: 65.25000063
 country: "Afghanistan"
 population: 15000
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }]

We have successfully transferred all cities from dataframe to Cities object, now lets store this object is serialized file:

#serialize message to string and write to serialized_file
serialized_file.write(Cities.SerializeToString())
serialized_file.close()

Cool, but there wouldn’t be much sense in storing data in serialized data, if we are not able to deserialize it back to human readable format…Lets’ do it…

# Reading data from serialized_file
serialized_file= open('Cities_proto', "rb")
Cities_read = cities_pb2.Cities()
Cities_read.ParseFromString(serialized_file.read())

Cities_read.City[0:2]

[name: "Qal eh-ye"
 lat: 34.983001708984375
 lon: 63.13330078125
 country: "Afghanistan"
 population: 2997
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }, name: "Chaghcharan"
 lat: 34.516700744628906
 lon: 65.25
 country: "Afghanistan"
 population: 15000
 isocods {
   code: "AF"
   type: ISO2
 }
 isocods {
   code: "AFG"
   type: ISO3
 }]

The final and the best way of checking if spatial data is not corrupted is to display it on the map (that is not true, but let’s do it for fun 😉 ), so I will create a simple function displaying given city on Folium map (If you don’t know how to do it, here is my post on that).

# Function finding city and displaying it on the foluim map
import folium
%matplotlib inline

def find_city(country_name):
    map = folium.Map(tiles='Mapbox Bright')
    found=0
    for i in Cities_read.City:
        if i.name==country_name:
            folium.Marker([i.lat,i.lon], popup=i.name+\
                          "<br><i>Country: "+str(i.country)+"</i>"\
                          "<br><i>Population: "+str(i.population)+"</i>").add_to(map)
            map.location=[i.lat,i.lon]
            map.zoom_start=5
            found=1
    if found==0:
        print ('City not found in our DB')
    return map

find_city('Krakow')

Krakow location looks good to me, and if you press marker icon you should see info. like: country and population (which also look fine..)

Out[14]:

Make this Notebook Trusted to load map: File -> Trust Notebook

serialized_file.close()

And that’s it! So now, it’s time for avro format (neah, it’s rather time to read some book, and stop looking at monitor.. )
As always, you can download this code from my github .