EDA in R on Titanic data by Michael Eryan
Introduction
The data is a sample and incomplete, so we cannot make any generalizations or definitive conclusions. My goal is just exploration, observation and some speculation.
Automatically produced summary of the data set
## [1] "My data set is titanic_data.csv"
## [1] "I imported it into a data.frame"
## [1] "It has 12 columns and 891 rows."
## [1] "Here are the columns in the dataset:"
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked"
## [1] "Here is the structure of the dataset:"
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 ...
## $ Survived : int 0 1 ...
## $ Pclass : int 3 1 ...
## $ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 ...
## $ Age : num 22 38 ...
## $ SibSp : int 1 1 ...
## $ Parch : int 0 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 ...
## $ Fare : num 7.25 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 ...
## NULL
## [1] "Here are a few rows from this data frame:"
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
## [1] "Q: Have all the categorical vars been converted to factor?"
## PassengerId Survived Pclass Name Sex Age
## 891 2 3 891 2 89
## SibSp Parch Ticket Fare Cabin Embarked
## 7 7 681 248 148 4
## [1] "Observation: Survived varialbe should really be a factor too. Let's tranform it."
## [1] "Let's re-order the Sex factor so that Male is first. It will be useful later."
## [1] "Also let's create a variable Minor - if Age <18.It will have missing values if Age does."
## [1] "Attach the data frame to save keystrokes later."
## [1] "Summary statistics for the whole data set"
## PassengerId Survived Pclass
## Min. : 1.0 Min. :0.0000 Min. :1.000
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000
## Median :446.0 Median :0.0000 Median :3.000
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Name Sex Age
## Abbing, Mr. Anthony : 1 male :577 Min. : 0.42
## Abbott, Mr. Rossmore Edward : 1 female:314 1st Qu.:20.12
## Abbott, Mrs. Stanton (Rosa Hunt) : 1 Median :28.00
## Abelson, Mr. Samuel : 1 Mean :29.70
## Abelson, Mrs. Samuel (Hannah Wizosky): 1 3rd Qu.:38.00
## Adahl, Mr. Mauritz Nils Martin : 1 Max. :80.00
## (Other) :885 NA's :177
## SibSp Parch Ticket Fare
## Min. :0.000 Min. :0.0000 1601 : 7 Min. : 0.00
## 1st Qu.:0.000 1st Qu.:0.0000 347082 : 7 1st Qu.: 7.91
## Median :0.000 Median :0.0000 CA. 2343: 7 Median : 14.45
## Mean :0.523 Mean :0.3816 3101295 : 6 Mean : 32.20
## 3rd Qu.:1.000 3rd Qu.:0.0000 347088 : 6 3rd Qu.: 31.00
## Max. :8.000 Max. :6.0000 CA 2144 : 6 Max. :512.33
## (Other) :852
## Cabin Embarked Survivedf Minor
## :687 : 2 0:549 FALSE:601
## B96 B98 : 4 C:168 1:342 TRUE :113
## C23 C25 C27: 4 Q: 77 NA's :177
## G6 : 4 S:644
## C22 C26 : 3
## D : 3
## (Other) :186
## [1] "Survived will be my Y, Gender is the first X, Age is continuous but has missing values (will not impute)"
## [1] "Q: What do the levels of Embarked mean?"
## [1] "" "C" "Q" "S"
## [1] "A: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)"
Univariate Plots and Analysis
Single variable tabulations
## Survived
## 0 1
## 549 342
## Sex
## male female
## 577 314
## Pclass
## 1 2 3
## 216 184 491
## Minor
## FALSE TRUE
## 601 113
Bar charts - like histograms for discrete data

## TableGrob (3 x 1) "arrange": 3 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
## 3 3 (3-3,1-1) arrange gtable[layout]
Observations: more males than females, more died than survived, 3 is the poorest socio-econ class.
Fancier bar charts - Counts and Proportions of Gender vs Survival

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
Histogram of Age
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.42 20.12 28.00 29.70 38.00 80.00 177

Histogram and Density plots combined

Bivariate Plots and Analysis
Let’s facet Age by Sex and Survival

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
Almost mirror image, but is it statistically significant? Let’s test by Anova.
First, let’s look at the mean of Age by gender*survival - groupby
## Group.1 Group.2 x
## 1 male 0 31.61806
## 2 female 0 25.04688
## 3 male 1 27.27602
## 4 female 1 28.84772
For survivors, the means are pretty close, but not so for dead. But is it significant?
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 156 156.1 0.697 0.404
## Residuals 288 64444 223.8
## 52 observations deleted due to missingness
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 2346 2346.4 11.99 0.000591 ***
## Residuals 422 82613 195.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 125 observations deleted due to missingness
Not significant among the survivors.
But definitely significant among the dead.
Yes, the difference in age by sex is definitely significant among the dead.
The men who died were older than the women who died.
Boxplot: age vs survived and sex

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
Honestly, these boxlplots are not as informative as the previous histograms
Let’s look closer at survival * sex
## [1] "Crosstabulation of Sex and Survival"
## Survived
## Sex 0 1 Sum
## male 468 109 577
## female 81 233 314
## Sum 549 342 891
## [1] "There definitely were more male than female passengers, but were females more likely to survive?"
## [1] "Calculate actual proportions"
## Survived
## Sex 0 1
## male 0.52525253 0.12233446
## female 0.09090909 0.26150393
## [1] "Suggests that females were more likely to survive: f-1:3, m-4:1"
## [1] "Let's test it - contingency table analysis using a chi-squared test"
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: t2
## X-squared = 260.72, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.4926894 0.6135708
## sample estimates:
## prop 1 prop 2
## 0.8110919 0.2579618
The results from this test say that the difference in proportions is not due to chance.
Let’s use another package to do a proper test.
Crosstabulation, Fisher’s exact, Contingency table for two categoricals.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 891
##
##
## | Survived
## Sex | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## male | 468 | 109 | 577 |
## | 35.583 | 57.120 | |
## | 0.811 | 0.189 | 0.648 |
## | 0.852 | 0.319 | |
## | 0.525 | 0.122 | |
## -------------|-----------|-----------|-----------|
## female | 81 | 233 | 314 |
## | 65.386 | 104.962 | |
## | 0.258 | 0.742 | 0.352 |
## | 0.148 | 0.681 | |
## | 0.091 | 0.262 | |
## -------------|-----------|-----------|-----------|
## Column Total | 549 | 342 | 891 |
## | 0.616 | 0.384 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 263.0506 d.f. = 1 p = 3.711748e-59
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 260.717 d.f. = 1 p = 1.197357e-58
##
##
## [1] "Pull the frequency table again"
## $prop.tbl
## y
## x 0 1
## male 0.52525253 0.12233446
## female 0.09090909 0.26150393
## [1] "Pull the results of the Chi-squared test"
## $chisq
##
## Pearson's Chi-squared test
##
## data: t
## X-squared = 263.05, df = 1, p-value < 2.2e-16
## [1] "Yes, p-value is very small, males were less likely to survive"
## [1] "Does this pattern hold when we look only at minors?"
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 113
##
##
## | Survived
## Sex | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## male | 35 | 23 | 58 |
## | 2.587 | 2.205 | |
## | 0.603 | 0.397 | 0.513 |
## | 0.673 | 0.377 | |
## | 0.310 | 0.204 | |
## -------------|-----------|-----------|-----------|
## female | 17 | 38 | 55 |
## | 2.728 | 2.326 | |
## | 0.309 | 0.691 | 0.487 |
## | 0.327 | 0.623 | |
## | 0.150 | 0.336 | |
## -------------|-----------|-----------|-----------|
## Column Total | 52 | 61 | 113 |
## | 0.460 | 0.540 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 9.846588 d.f. = 1 p = 0.00170147
##
## Pearson's Chi-squared test with Yates' continuity correction
## ------------------------------------------------------------
## Chi^2 = 8.697291 d.f. = 1 p = 0.003186833
##
##
## $chisq
##
## Pearson's Chi-squared test
##
## data: t
## X-squared = 9.8466, df = 1, p-value = 0.001701
Yes, males were less likely to survive, and the pattern holds even for minors.Note that there is an almost the same number of male and female minors: 58 and 55. But while 69.1% of females survived, only 39.7% of males survived.
Let’s look at the distribution of Age separately for males and females

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
This chart really make it obvious - even among minors (look at 15 year olds), men were less likely to survive. It also suggests that even among minors, the older ones were less likely to survive. This calls for a regression to estimate the marginal effects on probability of survival.
How will the box chart looks like for minors?

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
Looks similiar to the overall chart above. Shows that Minor or not, males were still less likely to survive.
Multivariate Plots And Analysis
Distribution of survival by age faceted by gender

Pretty stark difference by sex
Let’s build a Linear probability model first
Disclaimer: I will not create training/validation set and score as usual
## [1] "Let's see what will happen when we throw in Sex as well"
## [1] "Note that mtable is from memisc pkg"
##
## Calls:
## m1: lm(formula = tf, data = t)
## m2: lm(formula = Survived ~ Age + Pclass + Sex, data = t)
##
## ========================================
## m1 m2
## ----------------------------------------
## (Intercept) 1.240*** 0.847***
## (0.072) (0.067)
## Age -0.008*** -0.005***
## (0.001) (0.001)
## Pclass -0.264*** -0.203***
## (0.021) (0.019)
## Sex: female 0.479***
## (0.031)
## ----------------------------------------
## R-squared 0.2 0.4
## adj. R-squared 0.2 0.4
## sigma 0.4 0.4
## F 78.3 151.4
## p 0.0 0.0
## Log-likelihood -434.4 -328.9
## Deviance 141.1 105.0
## AIC 876.8 667.7
## BIC 895.1 690.6
## N 714 714
## ========================================
## [1] "Age and Pclass have negative sign, which makes sense"
## [1] "Being male has a really significant negative sign"
Let’s build a logistic regression and get the odds as well
##
## Calls:
## l1: glm(formula = tf, family = binomial(link = "logit"), data = t)
## l2: glm(formula = Survived ~ Age + Pclass + Sex, family = binomial(link = "logit"),
## data = t)
##
## ==============================================
## l1 l2
## ----------------------------------------------
## (Intercept) 3.585*** 2.534***
## (0.407) (0.456)
## Age -0.042*** -0.037***
## (0.007) (0.008)
## Pclass -1.244*** -1.289***
## (0.119) (0.139)
## Sex: female 2.522***
## (0.207)
## ----------------------------------------------
## Aldrich-Nelson R-sq. 0.2 0.3
## McFadden R-sq. 0.1 0.3
## Cox-Snell R-sq. 0.2 0.4
## Nagelkerke R-sq. 0.2 0.5
## phi 1.0 1.0
## Likelihood-ratio 137.1 317.2
## p 0.0 0.0
## Log-likelihood -413.7 -323.6
## Deviance 827.4 647.3
## AIC 833.4 655.3
## BIC 847.1 673.6
## N 714 714
## ==============================================
## [1] "Same signs as the LPM, good"
## [1] "Now let's exponentiate the coefficients to get the odds ratios"
## OR 2.5 % 97.5 %
## (Intercept) 12.6022495 5.227870 31.3646211
## Age 0.9637445 0.949163 0.9780179
## Pclass 0.2756716 0.208325 0.3599055
## Sexfemale 12.4551085 8.366012 18.8756140
Final Plots and Summary
Plot One: Bar chart of Survived and Gender

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
## [1] "Here are the actual percentages for the charts above:"
## Survived
## 0 1
## 62 38
## Sex
## male female
## 65 35
Plot One Discussion
Interesting, the distributions of Survived and Sex are almost identical.
That is 65% of passengers were male and 62% of passengers did not survive.
Is there a link between being male and not surviving?
Plot Two: Distribution of Age by Survived separately by Sex

## TableGrob (2 x 1) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
Plot Two Discussion
This plot looks like a mirror image: most of men did not survive, while most women did.
Plot Three: Linear probability model by Sex

Plot Three Discussion
The blue line can be interpreted as the predicted probability of surviving by age. For men this probability falls rapidly at 15 years old and stays pretty low. For women of all ages the probability of surviving is much higher.
Issues, Reflections, Conclusions (Speculations)
Issues: What we have is an incomplete data set - obviously there were more than 891 passengers on Titanic. We can make no assumptions about whether we got a random sample or a biased one. Therefore we cannot really make any general conclusions but we can speculate about the results.
There is missing data for Age. I did no imputations because I cannot make any assumptions whether the values are MCAR (missing completely at random). Those observations were just dropped by the R’s procedures when Age variable was invovled.
My exploratory data analysis suggested a link between survival and gender, so I pursued this and conducted statistical tests.
Chi-squared test of the frequency of survival based on gender (Sex) returned a very low p-value meaning this pattern could not have occurred by chance.
Both the linear and logistic probability models returned negative estimates for being male on survival.
The odds ratio of survival of being female comparing to being a male was 12.5.
This means that females were 12 times more likely to survive that males.
This pattern holds even for minors (<18 years old) - females were still more likely to survive.
This seems to suggest that in the good old saying “women and children first” boys of 15 years or older do not count as “children.”
Next steps? In the future I could review other tragic events and disasters and analyze factors that influence the probability of survival.
Appendix: Data Dictionary
VARIABLE DESCRIPTIONS:
survival Survival
(0 = No; 1 = Yes)
pclass Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)
The End.