Almost everyone who starts his journey with data science starts form Kaggle’s competition “Titanic”. This is a “Hello World” for ML model building, and so did I. For me that was some kind of experimental station… especially for training data preparation…. Below you can find my code. You can also download it from my github .
Competition Description: The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy. Goal: It is your job to predict if a passenger survived the sinking of the Titanic or not. For each PassengerId in the test set, you must predict a 0 or 1 value for the Survived variable.
# import required modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from sklearn import preprocessing
import seaborn as sns
import warnings
from IPython.display import display
warnings.filterwarnings("ignore")
plt.style.use('seaborn-white')
%matplotlib inline
#Read Titanic train data
train =pd.read_csv("./data/train.csv")
Data Dictionary Variable Definition Key survival Survival 0 = No, 1 = Yes pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd sex Sex Age Age in years sibsp # of siblings / spouses aboard the Titanic parch # of parents / children aboard the Titanic ticket Ticket number fare Passenger fare cabin Cabin number embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton Variable Notes pclass: A proxy for socio-economic status (SES) 1st = Upper 2nd = Middle 3rd = Lower age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored) parch: The dataset defines family relations in this way... Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
#Display 5 first rows of train dataframe
train.head()
#Check descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution
train.describe(include='all')
#Concise summary of a train DataFrame
train.info()
Usefulness of attributes:
1) PassengerId seems to be just a random identifier, so it has no value for our model and it will be
dropped from training\test datasets
2) Cabin attr, has too many Nan values, so it will be dropped from training\test datasets
# Define function for data analysis
def plot_Survival_Rate_by_attr(a):
display(train[['Survived',a]].groupby(a).mean())
pclass_survived_ct = pd.crosstab(train[a], train['Survived'])
pclass_survived_ct_div = pclass_survived_ct.div(pclass_survived_ct.sum(axis=1), axis=0)
cat_count = Counter(train[a])
df_cat=pd.DataFrame.from_dict(cat_count, orient='index')
df_cat.sort_index(inplace=True)
fig, ax =plt.subplots(1, 2, figsize=(18,5))
fig.subplots_adjust(wspace=0.2)
pclass_survived_ct_div.plot(ax=ax[1], kind='bar',
stacked=True, color =['#ff3333','#4dff4d'],
title='Survival Rate by '+a)
ax[1].legend(loc='best', bbox_to_anchor=(1, 1), title='Survived')
ax[1].set_ylabel('Survival Rate')
ax[1].set_xlabel(a)
df_cat.plot(ax=ax[0],kind='bar',title=a+' frequency')
ax[0].legend_.remove()
plot_Survival_Rate_by_attr('Sex')
As we could expect from ‘age of gentlemens’ women had higher chance to survive. That might be our most important attr. during classification.
Let’s change this our categorical varaible to numeric and as there are only two attr. values we will leave them in on numeric variable called “Sex_flag”
#changing categorical values to numeric
train['Sex_flag'] = train['Sex'].map({"female":0,"male":1}).astype(int)
plot_Survival_Rate_by_attr('Pclass')
Again not very surprising. More money = Higher class =Higher chance to survive….We will leave this attr. as it is.
plot_Survival_Rate_by_attr('Embarked')
#For missing Embarked values, we will find similar Fare values in the same Pclass and use those values
train.sort_values(['Fare','Pclass']).reset_index().iloc[813:819]
#Fill in NaN values for Embarked
train['Embarked']=train.sort_values(['Fare','Pclass'])['Embarked'].ffill()
#Convert categorical variable into dummy/indicator variables
train = pd.concat([train, pd.get_dummies(train['Embarked'])], axis=1);
train.head(5)
# display fare split on 10 equal bins frequency
plt.hist(train.Fare,bins=10);
train['Fare_bins']=pd.cut(train.Fare,bins=10)
plot_Survival_Rate_by_attr('Fare_bins')
# Lets try to normalizeFare (Transform by scaling it between 0-1 range)
minmax_scale = preprocessing.MinMaxScaler().fit(train['Fare'].values)
df_minmax = minmax_scale.transform(train['Fare'])
train['fare_norm'] =pd.DataFrame(df_minmax)
train.head(5)
#For missing Age values, we will use a median value of grouping over Sex and Pclass attribiutes
train.groupby(['Sex','Pclass'])[['Age']].median()
#Fill in NaN values for Age
train['Age']=train.groupby(['Sex','Pclass'])[['Age']].transform(lambda x: x.fillna(x.median()))
# Lets categorize age by creating 6 age categories
bins=[0,8,16,25,40,60,100]
train['Age_bins']=pd.cut(train.Age,bins,labels=["Children","Teenagers","Youth","Adults","Middle_age","Older_People"])
plot_Survival_Rate_by_attr('Age_bins')
#Convert categorical variable into dummy/indicator variables
train = pd.concat([train, pd.get_dummies(train['Age_bins'])], axis=1);
plot_Survival_Rate_by_attr('SibSp')
plot_Survival_Rate_by_attr('Parch')
# create one attr. from Parch and SibSp. All in all they store almost the same info...
train['Parch_SibSp'] =train.apply(lambda row: row['Parch']+row['SibSp'],axis=1)
plot_Survival_Rate_by_attr('Parch_SibSp')
# Let's flatten this category a little bit and change the order to better represent relation between it's value
#and chance for survival
def Parch_SibSp_bins(row):
if pd.isnull(row['Parch_SibSp']):
return row['Parch_SibSp']
if row['Parch_SibSp']==0:
return 1
elif (row['Parch_SibSp']>0 and row['Parch_SibSp']<4):
return 0
elif (row['Parch_SibSp']>3 and row['Parch_SibSp']<7):
return 2
else:
return 3
train['Parch_SibSp_bins'] =train.apply(lambda row: Parch_SibSp_bins(row),axis=1)
plot_Survival_Rate_by_attr('Parch_SibSp_bins')
# find words that occurs most often, on top of our list we should get titles
pd.Series([y for x in train.Name.values.flatten() for y in x.split()]).value_counts().head(10)
# Create a new attr. storing title
def Names_title(row):
if row.Name.lower().find('mr.')>-1:
return 1
if row.Name.lower().find('miss.')>-1:
return 2
if row.Name.lower().find('mrs.')>-1:
return 3
if row.Name.lower().find('master.')>-1:
return 4
else:
return 0
train['Names_title_flag'] =train.apply(lambda row: Names_title(row),axis=1)
plot_Survival_Rate_by_attr('Names_title_flag')
train.drop(['Fare_bins','Fare','Age','PassengerId', 'Name', 'Ticket', 'Cabin','Age_bins','Embarked','Sex', 'SibSp','Parch','Parch_SibSp'], axis=1, inplace=True)
train.head(5)
# Separate all attr. used for prediction from column that we are going to predict
train_x =train.iloc[:,1:]
train_y =train.iloc[:,0]
#Set aside 10% for final analysis
train_90, test_10 = train_test_split(train, test_size=0.1, random_state=0)
# Separate all attr. used for prediction from column that we are going to predict
train_90_x =train_90.iloc[:,1:]
train_90_y =train_90.iloc[:,0]
test_10_x =test_10.iloc[:,1:]
test_10_y =test_10.iloc[:,0]
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
from sklearn import metrics
def plot_model_result(model):
# fit the model on entire training data
model_all = model.fit(train_x, train_y)
# fit the model on training data
model_hold = model.fit(train_90_x, train_90_y)
# evaluate the mode with cross validation score
results=model_hold.predict(test_10_x)
fig, ax =plt.subplots(1, 2, figsize=(18,5))
cm = confusion_matrix(test_10_y, results)
#recall = TP / (TP + FN)
#precision = TP / (TP + FP)
#F-beta score is the weighted harmonic mean of precision and recall
pr,rec,fsc,supp =precision_recall_fscore_support(test_10_y, results)
sns.heatmap(cm, annot=True, ax = ax[0])
# labels, title and ticks
ax[0].set_xlabel('Predicted labels')
ax[0].set_ylabel('True labels')
ax[0].set_title('Confusion Matrix', fontsize=14)
ax[0].xaxis.set_ticklabels(['not survived', 'survived'])
ax[0].yaxis.set_ticklabels(['survived', 'not survived'])
fpr, tpr, thresholds = metrics.roc_curve(test_10_y, results)
# Print ROC curve
ax[1].plot(fpr,tpr);
ax[1].set_xlabel('False Positive Rate')
ax[1].set_ylabel('True Positive Rate')
ax[1].set_title('ROC curve', fontsize=14)
# evaluate the model with using Holdout sets
print ('90/10 Holdout sets accuracy score: ' + str(round( accuracy_score(test_10_y, results),2)))
# evaluate the model with cross validation score
print ('Cross validation score: ' + str(round(cross_val_score(model_all, train_x, train_y).mean(),2)))
print ('Area Under the Curve: ' + str(round(metrics.auc(fpr, tpr),2)))
print ('Recall for survival: ' + str(round(rec[1],2)))
print ('Precision for survival: ' + str(round(pr[1],2)))
print ('F-beta score for survival: ' + str(round(fsc[1],2)))
from sklearn.ensemble import RandomForestClassifier
# instantiate the Random Forest model class
rfc_model = RandomForestClassifier(n_estimators=100, random_state=0)
plot_model_result(rfc_model)
from sklearn.linear_model import LogisticRegression
# instantiate the Random Forest model class
lr_model = LogisticRegression(random_state=0)
plot_model_result(lr_model)
from sklearn.svm import SVC
# instantiate the Random Forest model class
svc_model = SVC(random_state=0)
plot_model_result(svc_model)
from sklearn.neighbors.nearest_centroid import NearestCentroid
# instantiate the Random Forest model class
nn_model = NearestCentroid()
plot_model_result(nn_model)
# Function for training data preparation
def train_data_prep (x):
# Sex attr. preparation
x['Sex_flag'] = x['Sex'].map({"female":0,"male":1}).astype(int)
# Embarked attr. preparation
x['Embarked']=x.sort_values(['Fare','Pclass'])['Embarked'].ffill()
x = pd.concat([x, pd.get_dummies(x['Embarked'])], axis=1)
# Fare attr. preparation
minmax_scale = preprocessing.MinMaxScaler().fit(x['Fare'].values)
df_minmax = minmax_scale.transform(x['Fare'])
x['fare_norm'] =pd.DataFrame(df_minmax)
# Age attr. preparation
x['Age']=x.groupby(['Sex','Pclass'])[['Age']].transform(lambda y: y.fillna(y.median()))
bins=[0,8,16,25,40,60,100]
x['Age_bins']=pd.cut(x.Age,bins,labels=["Children","Teenagers","Youth","Adults","Middle_age","Older_People"])
x = pd.concat([x, pd.get_dummies(x['Age_bins'])], axis=1);
# Parch & SibSp attr. preparation
x['Parch_SibSp'] =x.apply(lambda row: row['Parch']+row['SibSp'],axis=1)
x['Parch_SibSp_bins'] =x.apply(lambda row: Parch_SibSp_bins(row),axis=1)
#Name attr. preparation
x['Names_title_flag'] =x.apply(lambda row: Names_title(row),axis=1)
# Dropping unrequired attribiutes
x.drop(['Fare','Age','PassengerId', 'Name', 'Ticket', 'Cabin','Age_bins','Embarked','Sex', 'SibSp','Parch','Parch_SibSp'], axis=1, inplace=True)
return x
# Function for test data preparation
def test_data_prep (x, train):
x['ind']='Test'
train['ind']='Train'
frames = [x,train]
all_data = pd.concat(frames).reset_index(drop=True)
# Sex attr. preparation
all_data['Sex_flag'] = all_data['Sex'].map({"female":0,"male":1}).astype(int)
# Embarked attr. preparation
all_data['Embarked']=all_data.sort_values(['Fare','Pclass'])['Embarked'].ffill()
all_data = pd.concat([all_data, pd.get_dummies(all_data['Embarked'])], axis=1)
# Fare attr. preparation
all_data['Fare']=all_data.groupby('Pclass')[['Fare']].transform(lambda y: y.fillna(y.median()))
minmax_scale = preprocessing.MinMaxScaler().fit(all_data['Fare'].values)
df_minmax = minmax_scale.transform(all_data['Fare'])
all_data['fare_norm'] =pd.DataFrame(df_minmax)
# Age attr. preparation
all_data['Age']=all_data.groupby(['Sex','Pclass'])[['Age']].transform(lambda y: y.fillna(y.median()))
bins=[0,8,16,25,40,60,100]
all_data['Age_bins']=pd.cut(all_data.Age,bins,labels=["Children","Teenagers","Youth","Adults","Middle_age","Older_People"])
all_data = pd.concat([all_data, pd.get_dummies(all_data['Age_bins'])], axis=1);
# Parch & SibSp attr. preparation
all_data['Parch_SibSp'] =all_data.apply(lambda row: row['Parch']+row['SibSp'],axis=1)
all_data['Parch_SibSp_bins'] =all_data.apply(lambda row: Parch_SibSp_bins(row),axis=1)
#Name attr. preparation
all_data['Names_title_flag'] =all_data.apply(lambda row: Names_title(row),axis=1)
# Dropping unrequired attribiutes
all_data= all_data[all_data.ind == 'Test']
all_data.drop(['Fare','ind','Age', 'PassengerId', 'Survived','Name', 'Ticket', 'Cabin','Age_bins','Embarked','Sex', 'SibSp','Parch','Parch_SibSp'], axis=1, inplace=True)
#x.set_index('PassengerId', inplace=True)
return all_data
#Read Titanic train data
train =pd.read_csv("./data/train.csv")
test =pd.read_csv("./data/test.csv")
train=train_data_prep(train)
X_test= test_data_prep (test, train)
X_test.info()
#Y_pred = svc_model.predict(X_test)
Y_pred = lr_model.predict(X_test)
submission = pd.DataFrame({
"PassengerId": test["PassengerId"],
"Survived": Y_pred
})
submission.to_csv('submission.csv', index=False)
My submission to the Kaggle’s competition in scoring at 3,236 positon (out of 10,568 competition entries)