Predicting outcomes for passengers on the Titanic with deep learning

January 21, 2024 / keras training neural networks deep learning

The passenger list on the Titanic is a popular dataset for machine learning, so I thought it was a fitting way to start this documentation of my AI experiements.

It’s a slightly morbid dataset made classic by the fact that it’s the first dataset used for anyone starting out on Kaggle, the data science competition platform.

The idea is to use data science to predict whether a passenger will survive or not, using a information we have about the passengers.

Open this notebook in Colab to try for yourself.

Import and explore

First, we import pandas and explore the data.

import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

data = pd.read_csv('titanic.csv')
data.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The passenger list includes different bits of information about the passengers, such as their age, gender, what class they travelled on, how much they paid for their ticket, and so on. For more information about the data, see the Kaggle competition webpage.

If we look at the first two rows in the data, we see Mr. Owen Harris Braund and Mrs. John Bradley (Florence Briggs Thayer).

Mrs. Bradley paid ten times what Mr. Owen did for her fair, travelled on first class, while he travelled third class, and she also survived, unlike Mr. Owen.

data.iloc[0:2].set_index('Name').T

Name	Braund, Mr. Owen Harris	Cumings, Mrs. John Bradley (Florence Briggs Thayer)
PassengerId	1	2
Survived	0	1
Pclass	3	1
Sex	male	female
Age	22.0	38.0
SibSp	1	1
Parch	0	0
Ticket	A/5 21171	PC 17599
Fare	7.25	71.2833
Cabin	NaN	C85
Embarked	S	C

Select and prepare data

For the purpose of this excersize, we are only going to use data about survival, passenger class, age, siblings on board and parents on board. We will ignore the fare, cabin number and what port they sailed from.

data = data[['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch']]

# Impute missing values with the median
data['Age'] = data['Age'].fillna(data['Age'].median())
# scale/normalize age
scaler = MinMaxScaler()
data['Age'] = scaler.fit_transform(data[['Age']])

# one-hot encode sex
data = pd.get_dummies(data, columns=['Sex'])

After a bit of data wrangling, we’ve changed the data we are working with to a more machine friendly verison, by normalizing the age (making it into a scale of 0-1 with the oldest person on board being 1) and “one-hot encoding” their gender.

data.head()

	Survived	Pclass	Age	SibSp	Sex_female	Sex_male
0	0	3	0.271174	1	0	1
1	1	1	0.472229	1	1	0
2	1	3	0.321438	0	1	0
3	1	1	0.434531	1	1	0
4	0	3	0.434531	0	0	1

Split the dataset into training and testing

We need to split the data into the batch we will use to train the neural network model and a batch that the model has never seen, to test how good it is.

# Split the dataset into train and test sets
train_df, test_df = train_test_split(data, test_size=0.2, random_state=42)

# Further split the training set into train and validation sets
train_df, val_df = train_test_split(train_df, test_size=0.25, random_state=42) # 0.25 x 0.8 = 0.2

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df['Survived']

X_test = test_df.drop("Survived", axis=1)
Y_test = test_df['Survived']
print(X_train.shape)
print(Y_train.shape)

(534, 6)
(534,)

Create a deep learning model and train it

Creating a basic deep learning model with keras is quite easy, we just create each layer of the model and add them in sequence. We choose dense layers, where each node is connected to every other node in the next layer and have 20 nodes in each layer.

from keras.layers import Dense, Input #, Dropout
from keras.models import Sequential

model = Sequential()

model.add(Dense(units=100, input_shape=(6,), activation='relu'))
model.add(Dense(units=100, activation='relu'))
model.add(Dense(units =1 , activation = 'sigmoid'))

model.compile(
    loss = tf.keras.losses.binary_crossentropy,
    optimizer = tf.keras.optimizers.Adam(),
    metrics = ['acc']
)
model.fit(X_train, Y_train, verbose = 2, epochs = 20)

Epoch 1/20
17/17 - 1s - loss: 0.6533 - acc: 0.5749 - 792ms/epoch - 47ms/step
Epoch 2/20
17/17 - 0s - loss: 0.5574 - acc: 0.7360 - 33ms/epoch - 2ms/step
Epoch 3/20
17/17 - 0s - loss: 0.5080 - acc: 0.7884 - 34ms/epoch - 2ms/step
Epoch 4/20
17/17 - 0s - loss: 0.4818 - acc: 0.7921 - 30ms/epoch - 2ms/step
Epoch 5/20
17/17 - 0s - loss: 0.4713 - acc: 0.7959 - 35ms/epoch - 2ms/step
Epoch 6/20
17/17 - 0s - loss: 0.4676 - acc: 0.7884 - 32ms/epoch - 2ms/step
Epoch 7/20
17/17 - 0s - loss: 0.4585 - acc: 0.8052 - 35ms/epoch - 2ms/step
Epoch 8/20
17/17 - 0s - loss: 0.4515 - acc: 0.7996 - 33ms/epoch - 2ms/step
Epoch 9/20
17/17 - 0s - loss: 0.4430 - acc: 0.8090 - 34ms/epoch - 2ms/step
Epoch 10/20
17/17 - 0s - loss: 0.4421 - acc: 0.8071 - 33ms/epoch - 2ms/step
Epoch 11/20
17/17 - 0s - loss: 0.4354 - acc: 0.8090 - 35ms/epoch - 2ms/step
Epoch 12/20
17/17 - 0s - loss: 0.4357 - acc: 0.8071 - 36ms/epoch - 2ms/step
Epoch 13/20
17/17 - 0s - loss: 0.4325 - acc: 0.8109 - 33ms/epoch - 2ms/step
Epoch 14/20
17/17 - 0s - loss: 0.4328 - acc: 0.7959 - 35ms/epoch - 2ms/step
Epoch 15/20
17/17 - 0s - loss: 0.4399 - acc: 0.7959 - 32ms/epoch - 2ms/step
Epoch 16/20
17/17 - 0s - loss: 0.4255 - acc: 0.8034 - 35ms/epoch - 2ms/step
Epoch 17/20
17/17 - 0s - loss: 0.4220 - acc: 0.8127 - 35ms/epoch - 2ms/step
Epoch 18/20
17/17 - 0s - loss: 0.4217 - acc: 0.8109 - 32ms/epoch - 2ms/step
Epoch 19/20
17/17 - 0s - loss: 0.4212 - acc: 0.8090 - 35ms/epoch - 2ms/step
Epoch 20/20
17/17 - 0s - loss: 0.4172 - acc: 0.8184 - 32ms/epoch - 2ms/step

<keras.src.callbacks.History at 0x7e0516b54f40>

Test the model on data it hasn’t seen

Now that we’ve successfully trained the model and the loss function has stopped improving

loss, accuracy = model.evaluate(X_test, Y_test)
print(f"Test Loss: {loss}")
print(f"Test Accuracy: {accuracy}")

6/6 [==============================] - 0s 3ms/step - loss: 0.4432 - acc: 0.8156
Test Loss: 0.44318273663520813
Test Accuracy: 0.8156424760818481

The accuracy is higher than 0.8, which is a very good score on this particular challenge.

In Titanic leaderboard: a score > 0.8 is great!, Carl Ellis analyses how people are scoring in the competition, and the image shows that any score above 0.8 is ahead of the curve.

Score distribution from Kaggle competition

Results

It’s quite amazing how we can get such good results with a very basic deep learning model. There’s a lot that can be done to add finesse to the model, but the out-of-the box standard one gets very good results. The model has just over 10k parameters, which is not much compared to the billions of parameters used in large language models, so training it takes less than a second.