Apache Spark basics

    In Big Data containers era, there is no way, that you will avoid working with cluster-computing frameworks, like Hadoop or Spark. At some point, I had to choose between those two, and as Apache spark seems to be more flexible and faster, I decided to look closer into it. Following …

Image recognition basics

There is multiple image classification datasets available online or embedded in python ML related modules, and this notebook contains just a sample code for image classification on those publicly available datasets. In this post, I will just use a very ‘blond’ solution and definitely not a perfect one (deep neural …

Protocol Buffers

“…Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler. You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from …

My Kaggle Titanic competition submission

Almost everyone who starts his journey with data science starts form Kaggle’s competition “Titanic”. This is a “Hello World” for ML model building, and so did I. For me that was some kind of experimental station… especially for training data preparation…. Below you can find my code. You can also …

Spatial data visualization in python

Although it is much more convenient to use software dedicated for GIS, like ArcGIS or QGIS, for spatial data visualization, but ability to display spatial data within your code (especially if you are working with notebooks) might be very handy. Currently there are tens of geo-spatial python libraries, and here …