The passenger list on the Titanic is a popular dataset for machine learning, so I thought it was a fitting way to start this documentation of my AI experiements.
It’s a slightly morbid dataset made classic by the fact that it’s the first dataset used for anyone starting out on Kaggle, the data science competition platform.
The idea is to use data science to predict whether a passenger will survive or not, using a information we have about the passengers.
import pandas as pdimport tensorflow as tffrom sklearn.preprocessing import MinMaxScalerfrom sklearn.model_selection import train_test_splitdata = pd.read_csv('titanic.csv')data.head()
PassengerId
Survived
Pclass
Name
Sex
Age
SibSp
Parch
Ticket
Fare
Cabin
Embarked
0
1
0
3
Braund, Mr. Owen Harris
male
22.0
1
0
A/5 21171
7.2500
NaN
S
1
2
1
1
Cumings, Mrs. John Bradley (Florence Briggs Th...
female
38.0
1
0
PC 17599
71.2833
C85
C
2
3
1
3
Heikkinen, Miss. Laina
female
26.0
0
0
STON/O2. 3101282
7.9250
NaN
S
3
4
1
1
Futrelle, Mrs. Jacques Heath (Lily May Peel)
female
35.0
1
0
113803
53.1000
C123
S
4
5
0
3
Allen, Mr. William Henry
male
35.0
0
0
373450
8.0500
NaN
S
The passenger list includes different bits of information about the passengers, such as their age, gender, what class they travelled on, how much they paid for their ticket, and so on. For more information about the data, see the Kaggle competition webpage.
Mrs. Bradley paid ten times what Mr. Owen did for her fair, travelled on first class, while he travelled third class, and she also survived, unlike Mr. Owen.
data.iloc[0:2].set_index('Name').T
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
PassengerId
1
2
Survived
0
1
Pclass
3
1
Sex
male
female
Age
22.0
38.0
SibSp
1
1
Parch
0
0
Ticket
A/5 21171
PC 17599
Fare
7.25
71.2833
Cabin
NaN
C85
Embarked
S
C
Select and prepare data
For the purpose of this excersize, we are only going to use data about survival, passenger class, age, siblings on board and parents on board. We will ignore the fare, cabin number and what port they sailed from.
data = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']]
After a bit of data wrangling, we’ve changed the data we are working with to a more machine friendly verison, by normalizing the age (making it into a scale of 0-1 with the oldest person on board being 1) and “one-hot encoding” their gender.
data.head()
Survived
Pclass
Age
SibSp
Parch
Sex_female
Sex_male
0
0
3
0.271174
1
0
0
1
1
1
1
0.472229
1
0
1
0
2
1
3
0.321438
0
0
1
0
3
1
1
0.434531
1
0
1
0
4
0
3
0.434531
0
0
0
1
Split the dataset into training and testing
We need to split the data into the batch we will use to train the neural network model and a batch that the model has never seen, to test how good it is.
# Split the dataset into train and test setstrain_df, test_df = train_test_split(data, test_size=0.2, random_state=42)# Further split the training set into train and validation setstrain_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2
Creating a basic deep learning model with keras is quite easy, we just create each layer of the model and add them in sequence. We choose dense layers, where each node is connected to every other node in the next layer and have 20 nodes in each layer.
6/6 [==============================] - 0s 3ms/step - loss: 0.4432 - acc: 0.8156
Test Loss: 0.44318273663520813
Test Accuracy: 0.8156424760818481
The accuracy is higher than 0.8, which is a very good score on this particular challenge.
In Titanic leaderboard: a score > 0.8 is great!, Carl Ellis analyses how people are scoring in the competition, and the image shows that any score above 0.8 is ahead of the curve.
Results
It’s quite amazing how we can get such good results with a very basic deep learning model. There’s a lot that can be done to add finesse to the model, but the out-of-the box standard one gets very good results. The model has just over 10k parameters, which is not much compared to the billions of parameters used in large language models, so training it takes less than a second.