EDA on the Titanic dataset with RStudio

What factors might determine the survival of passengers?

8 min readJan 29, 2021

Titanic (1997) is a well-known romantic and disaster movie based on the historical story of the sinking of the RMS Titanic in the North Atlantic Ocean in 1912. There are so many memorable scenes of Titanic, but my favorite one might be Jack’s death scene. In this scene, Rose sees the light from the lifeboat and tries to call Jack to wake him up but he is already dead. Not only Jack, but some passengers are also dead. However, Rose is the survived passenger from the sinking of the Titanic. It seems that there are some factors that make Rose and some passengers survive this disaster whereas other passengers die.

**The sinking of the Titanic (1912)** Reference: https://en.wikipedia.org/wiki/Sinking_of_the_Titanic

To find the answer, we need to know what kind of passengers that more likely to survive? Based on the available Titanic dataset on Kaggle, I do the exploratory data analysis (EDA) on the Titanic dataset using the training dataset to answer this question.

Let’s explore the Titanic dataset

The training dataset is consist of the information of individual passenger as following:

Passenger Id
Survived: 0 = No, 1 = Yes
Pclass — Ticket class: 1st (upper), 2nd(middle), 3rd(lower)
Name
Sex: male, female
Age
SibSp — The number of siblings / spouses aboard the Titanic
Parch — The number of parents / children aboard the Titanic
Ticket — Ticket number
Fare — Passenger fare
Cabin — Cabin number
Embarked — Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton

After taking a quick look, I see 4 variables (“PassengerId”, “Name”, “Ticket”, “Cabin”) that might not help much for answering the question. Therefore, I choose 8 rest variables for further analysis.

Performing EDA with RStudio

Downloading the training dataset (“train.csv”) on Kaggle

https://www.kaggle.com/c/titanic/data

Installing the relavent packages and loading libraries

At first, I need to install the relevant R packages for data manipulation and visualization. Then, the libraries of these packages are loaded.

# install packages
install.packages("ggplot2")
install.packages("dplyr")
install.packages("gtsummary")
install.packages("gt")
webshot::install_phantomjs()# load libraries
library(ggplot2)
library(dplyr)
library(gtsummary)
library(gt)
library(webshot)

Data cleanning

Before performing EDA, I want to see the structure of the dataset and check whether there are any missing values in the dataset.

# read "train.csv" file and avoid converting strings to factors
train <- read.csv("train.csv ", stringsAsFactors = FALSE)# take a look at the structure of the dataset
str(train)# sumarize missing values in each column
colSums(is.na(train))
colSums(train == "")

Output:

After checking the dataset, I see some missing values in Age, Cabin, and Embarked columns, but they are in different forms — “NA” for Age and “empty string” for Cabin and Embarked. Thus, I want to convert “empty string” into “NA” for easier manipulation.

# replace empty strings with NA 
train <- train %>%
  mutate_all(na_if,"")# check missing values again
colSums(is.na(train))

Output:

Creating the descriptive statistics table

To show a whole picture of the dataset, I create the descriptive statistics table of 8 survival-related variables that I choose at the beginning.

# select variables for creating the table and replace old values with new one for easier understanding
sum <- train %>%
  select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare, Embarked) %>%
  mutate(
    Pclass = case_when(Pclass == "1" ~ "1st",
                       Pclass == "2" ~ "2nd",
                       Pclass == "3" ~ "3rd"),
    Embarked = case_when(Embarked == "C" ~ "Cherbourg",
                         Embarked == "Q" ~ "Queenstown",
                         Embarked == "S" ~ "Southampton")
    )# create the table
t1 <- sum %>% 
  tbl_summary(
    by = Survived,
    missing = "no"
    ) %>%
  add_n() %>%
  modify_header(update = list(
    label ~ "**Variables**",
    stat_1 ~ "**No**, (N = 549)",
    stat_2 ~ "**Yes** (N = 342)")) %>%
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Survived**") %>%
  bold_labels()# save the table as image
t1 %>%
  as_gt() %>%
  gt::gtsave(
    filename = "Table1.png"
  )

Output:

**Table 1:** The *descriptive statistics table of the Titanic data*

Table 1 shows the summary of the Titanic dataset in each variable after missing values are excluded. From a total of 891 passengers of the Titanic (N), 549 passengers are not survived and 342 passengers are survived. Each variable is divided into 2 groups — not survived group and survived group. The categorical variables including Pclass, Sex, SibSp, Parch, and Embarked are counted the passengers in both groups and represent as the percentages. While the numerical variables including Age and Fare are calculated the interquartile range (IQR) in both groups.

Creating the graph for each variable

Survived: the number of not-survived and survived passengers

survival <- train %>%
  count(Survived)ggplot(survival, aes(x = as.factor(Survived), y = n, fill = as.factor(Survived))) +
  geom_bar(stat = "identity", width = 0.5) + 
  geom_text(aes(label = n), vjust = -0.5) +
  theme(legend.position = "none") +
  xlab("Survived") +
  ylab("Count (n)") +
  scale_fill_manual(values = c("#FF6666", "#00B492")) +
  scale_x_discrete(label = c("0" = "No","1" = "Yes"))

Output:

**Figure 1：**The bar plot shows the number of survival passengers

Figure 1 shows the number of survival passengers from the sinking of the Titanic. From the bar plot, 549 passengers are not survived (red) whereas 342 passengers are survived (green). Apparently, most of the passengers are not survived the disaster.

Pclass vs Survived:

ticket <- train %>%
  group_by(Pclass) %>%
  count(Survived)ggplot(ticket, aes(x = as.factor(Pclass), y = n, fill = as.factor(Survived))) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = n),vjust = -0.5, 
            position = position_dodge(width = 0.9)) +
  xlab("Pclass") +
  ylab("Count (n)") +
  labs(fill = "Survived") +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" )) +
  scale_x_discrete(label = c("1" = "1st", "2" = "2nd", "3" = "3rd"))

Output:

**Figure 2:** The bar plot shows the number of not survived and survived passenger in each ticket class

Figure 2 shows the number of not survived and survived passengers in the upper class (1st), the middle class (2nd), and the lower class (3rd) respectively. The survived passengers in the first class are higher than others. In contrast, the not survived passengers in the lower class are higher than others.

Sex vs Survived:

gender <-
  train %>%
  group_by(Sex) %>%
  count(Survived)ggplot(gender, aes(x = Sex, y = n, fill = as.factor(Survived))) +
  geom_bar(position="dodge", stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5,
            position = position_dodge(width = 0.9)) +
  ylab("Count (n)") +
  labs(fill = "Survived") +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" )) +
  scale_x_discrete(label = c("female" = "Female", "male" = "Male"))

Output:

**Figure3:** The bar plot shows the number of not survived and survived passenger in each sex

Figure 3 shows the number of not survived and survived passengers in females and males. Most of the survived passengers are female whereas most of not survived passengers are male.

Age vs Survived:

# remove NA in Age column
life <- train %>%
  filter(!is.na(Age))
  
# multi density chart
ggplot(life, aes(x = Age)) +
  geom_density(aes(fill = as.factor(Survived)), alpha = 0.5) +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" )) +
  theme(legend.position = c(0.85, 0.55),
        legend.background = element_rect(fill = "white", 
                                         color = "black")) +
  xlab("Age (Years)") +
  ylab("Density") +
  labs(fill = "Survived")

Output:

**Figure 4:** The multi-density chart shows the distribution of age of not survived and survived passengers

Figure 4 shows the distribution of age of survived passengers compares to those not survived. The median age of both groups is equal to 28 years (Table1). For the passengers who are younger than 15 years, the number of survived passengers is greater than not survived passengers. It might indicate that the children are more likely to survive the disaster. For the passengers who are between 15–30 years and older than 60 years, the number of survived passengers is lesser than not survived passengers.

SibSp vs Survived

counsin <- train %>%
  group_by(SibSp) %>%
  count(Survived)ggplot(counsin, aes(x = as.factor(SibSp), y = n, fill = as.factor(Survived))) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5, 
            position = position_dodge(width = 1)) +
  xlab("SibSp (n)") +
  ylab("Count (n)") +
  labs(fill = "Survived") +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" ))

Output:

**Figure 5:** The bar chart shows the number of siblings/spouses with the different number of not survived and survived passengers

Figure 5 shows the number of survived passengers tends to lower if the number of siblings increases. In addition, the number of survived passengers is highest in the passengers who don’t have any siblings/spouses. Thus, it might indicate that the passengers who don’t have any siblings/spouses are more likely to survive than the passengers who have siblings/spouses.

Parch vs Survived:

family <- train %>%
  group_by(Parch) %>%
  count(Survived)ggplot(family, aes(x = as.factor(Parch), y = n, fill = as.factor(Survived))) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5, 
            position = position_dodge(width = 1)) +
  xlab("Parch (n)") +
  ylab("Count (n)") +
  labs(fill = "Survived") +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" ))

Output:

**Figure 6:** The bar chart shows the number of parents/children with the different number of not survived and survived passengers

Figure 6 shows the number of survived passengers tends to lower if the number of parents/children increases. Thus, it might indicate that the passengers who don’t have any parents/children are more likely to survive which is similar to the SibSp result.

Fare vs Survived:

ggplot(train, aes(x = Fare)) +
  geom_density(aes(fill = as.factor(Survived)),alpha = 0.5) +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" )) +
  theme(legend.position = c(0.85, 0.55),
        legend.background = element_rect(fill = "white", color = "black")) +
  ylab("Density") +
  labs(fill = "Survived")

Output:

Figure 7: The multi-density chart shows the distribution of fare of not survived and survived passengers

Figure 7 shows the distribution of age of survived passengers compares to those not survived. The number of survived passengers is greater than not survived passengers when the fare is high whereas the number of survived passengers is lesser than not survived passengers when the fare is low. It might indicate that the passengers who paid the high fare are more likely to survive than the passengers who paid the low fare. This result corresponds to the Pclass result.

Embarked vs Survived

port <- train %>%
  group_by(Embarked) %>%
  count(Survived)# remove missing value rows
port <- na.omit(port)ggplot(port, aes(x = Embarked, y = n, fill = as.factor(Survived))) +
  geom_bar(position = "dodge", stat = "identity") +
  geom_text(aes(label = n),vjust = -0.5, 
            position = position_dodge(width = 0.9)) +
  ylab("Count (n)") +
  labs(fill = "Survived") +
  scale_fill_manual(values = c("#FF6666", "#00B492"),
                    label = c("No","Yes" )) +
  scale_x_discrete(label = c("C" = "Cherbourg","Q" = "Queenstown",
                             "S" = "Southampton"))

Output:

**Figure 8:** The bar plot shows the number of not survived and survived passenger from each port of embarkation

Figure 8 shows the number of not survived and survived passengers from the port of Cherbourg, Queenstown, and Southampton respectively. The number of survived passengers from the port of Southampton is greater than the others. However, the number of not survived passengers is greater than the others as well.

That’s all my EDA on the titanic dataset :)

I hope everyone who is interested in data science and wants to start learning data analysis the same as me will enjoy my EDA.

So, let’s start your own EDA!

Arissara

EDA on the Titanic dataset with RStudio

What factors might determine the survival of passengers?

Let’s explore the Titanic dataset

Performing EDA with RStudio

Written by Pang Arissara