EDA in R on Titanic data by Michael Eryan

Introduction

The data is a sample and incomplete, so we cannot make any generalizations or definitive conclusions. My goal is just exploration, observation and some speculation.

Automatically produced summary of the data set

## [1] "My data set is titanic_data.csv"
## [1] "I imported it into a data.frame"
## [1] "It has 12 columns and 891 rows."
## [1] "Here are the columns in the dataset:"
##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"
## [1] "Here is the structure of the dataset:"
## 'data.frame':    891 obs. of  12 variables:
##  $ PassengerId: int  1 2 ...
##  $ Survived   : int  0 1 ...
##  $ Pclass     : int  3 1 ...
##  $ Name       : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 ...
##  $ Age        : num  22 38 ...
##  $ SibSp      : int  1 1 ...
##  $ Parch      : int  0 0 ...
##  $ Ticket     : Factor w/ 681 levels "110152","110413",..: 524 597 ...
##  $ Fare       : num  7.25 ...
##  $ Cabin      : Factor w/ 148 levels "","A10","A14",..: 1 83 ...
##  $ Embarked   : Factor w/ 4 levels "","C","Q","S": 4 2 ...
## NULL
## [1] "Here are a few rows from this data frame:"
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q
## [1] "Q: Have all the categorical vars been converted to factor?"
## PassengerId    Survived      Pclass        Name         Sex         Age 
##         891           2           3         891           2          89 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           7           7         681         248         148           4
## [1] "Observation: Survived varialbe should really be a factor too. Let's tranform it."
## [1] "Let's re-order the Sex factor so that Male is first. It will be useful later."
## [1] "Also let's create a variable Minor - if Age <18.It will have missing values if Age does."
## [1] "Attach the data frame to save keystrokes later."
## [1] "Summary statistics for the whole data set"
##   PassengerId       Survived          Pclass     
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :446.0   Median :0.0000   Median :3.000  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309  
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000  
##                                                  
##                                     Name         Sex           Age       
##  Abbing, Mr. Anthony                  :  1   male  :577   Min.   : 0.42  
##  Abbott, Mr. Rossmore Edward          :  1   female:314   1st Qu.:20.12  
##  Abbott, Mrs. Stanton (Rosa Hunt)     :  1                Median :28.00  
##  Abelson, Mr. Samuel                  :  1                Mean   :29.70  
##  Abelson, Mrs. Samuel (Hannah Wizosky):  1                3rd Qu.:38.00  
##  Adahl, Mr. Mauritz Nils Martin       :  1                Max.   :80.00  
##  (Other)                              :885                NA's   :177    
##      SibSp           Parch             Ticket         Fare       
##  Min.   :0.000   Min.   :0.0000   1601    :  7   Min.   :  0.00  
##  1st Qu.:0.000   1st Qu.:0.0000   347082  :  7   1st Qu.:  7.91  
##  Median :0.000   Median :0.0000   CA. 2343:  7   Median : 14.45  
##  Mean   :0.523   Mean   :0.3816   3101295 :  6   Mean   : 32.20  
##  3rd Qu.:1.000   3rd Qu.:0.0000   347088  :  6   3rd Qu.: 31.00  
##  Max.   :8.000   Max.   :6.0000   CA 2144 :  6   Max.   :512.33  
##                                   (Other) :852                   
##          Cabin     Embarked Survivedf   Minor    
##             :687    :  2    0:549     FALSE:601  
##  B96 B98    :  4   C:168    1:342     TRUE :113  
##  C23 C25 C27:  4   Q: 77              NA's :177  
##  G6         :  4   S:644                         
##  C22 C26    :  3                                 
##  D          :  3                                 
##  (Other)    :186
## [1] "Survived will be my Y, Gender is the first X, Age is continuous but has missing values (will not impute)"
## [1] "Q: What do the levels of Embarked mean?"
## [1] ""  "C" "Q" "S"
## [1] "A: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)"

Univariate Plots and Analysis

Single variable tabulations

## Survived
##   0   1 
## 549 342
## Sex
##   male female 
##    577    314
## Pclass
##   1   2   3 
## 216 184 491
## Minor
## FALSE  TRUE 
##   601   113

Bar charts - like histograms for discrete data

## TableGrob (3 x 1) "arrange": 3 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
## 3 3 (3-3,1-1) arrange gtable[layout]

Observations: more males than females, more died than survived, 3 is the poorest socio-econ class.

Fancier bar charts - Counts and Proportions of Gender vs Survival

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Histogram of Age

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.42   20.12   28.00   29.70   38.00   80.00     177

Histogram and Density plots combined

Bivariate Plots and Analysis

Let’s facet Age by Sex and Survival

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Almost mirror image, but is it statistically significant? Let’s test by Anova.

First, let’s look at the mean of Age by gender*survival - groupby

##   Group.1 Group.2        x
## 1    male       0 31.61806
## 2  female       0 25.04688
## 3    male       1 27.27602
## 4  female       1 28.84772

For survivors, the means are pretty close, but not so for dead. But is it significant?

##              Df Sum Sq Mean Sq F value Pr(>F)
## Sex           1    156   156.1   0.697  0.404
## Residuals   288  64444   223.8               
## 52 observations deleted due to missingness
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## Sex           1   2346  2346.4   11.99 0.000591 ***
## Residuals   422  82613   195.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 125 observations deleted due to missingness

Not significant among the survivors.

But definitely significant among the dead.

Yes, the difference in age by sex is definitely significant among the dead.

The men who died were older than the women who died.

Boxplot: age vs survived and sex

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Honestly, these boxlplots are not as informative as the previous histograms

Let’s look closer at survival * sex

## [1] "Crosstabulation of Sex and Survival"
##         Survived
## Sex        0   1 Sum
##   male   468 109 577
##   female  81 233 314
##   Sum    549 342 891
## [1] "There definitely were more male than female passengers, but were females more likely to survive?"
## [1] "Calculate actual proportions"
##         Survived
## Sex               0          1
##   male   0.52525253 0.12233446
##   female 0.09090909 0.26150393
## [1] "Suggests that females were more likely to survive: f-1:3, m-4:1"
## [1] "Let's test it - contingency  table analysis using a chi-squared test"
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  t2
## X-squared = 260.72, df = 1, p-value < 2.2e-16
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.4926894 0.6135708
## sample estimates:
##    prop 1    prop 2 
## 0.8110919 0.2579618

The results from this test say that the difference in proportions is not due to chance.

Let’s use another package to do a proper test.

Crosstabulation, Fisher’s exact, Contingency table for two categoricals.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  891 
## 
##  
##              | Survived 
##          Sex |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##         male |       468 |       109 |       577 | 
##              |    35.583 |    57.120 |           | 
##              |     0.811 |     0.189 |     0.648 | 
##              |     0.852 |     0.319 |           | 
##              |     0.525 |     0.122 |           | 
## -------------|-----------|-----------|-----------|
##       female |        81 |       233 |       314 | 
##              |    65.386 |   104.962 |           | 
##              |     0.258 |     0.742 |     0.352 | 
##              |     0.148 |     0.681 |           | 
##              |     0.091 |     0.262 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       549 |       342 |       891 | 
##              |     0.616 |     0.384 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  263.0506     d.f. =  1     p =  3.711748e-59 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  260.717     d.f. =  1     p =  1.197357e-58 
## 
## 
## [1] "Pull the frequency table again"
## $prop.tbl
##         y
## x                 0          1
##   male   0.52525253 0.12233446
##   female 0.09090909 0.26150393
## [1] "Pull the results of the Chi-squared test"
## $chisq
## 
##  Pearson's Chi-squared test
## 
## data:  t
## X-squared = 263.05, df = 1, p-value < 2.2e-16
## [1] "Yes, p-value is very small, males were less likely to survive"
## [1] "Does this pattern hold when we look only at minors?"
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  113 
## 
##  
##              | Survived 
##          Sex |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##         male |        35 |        23 |        58 | 
##              |     2.587 |     2.205 |           | 
##              |     0.603 |     0.397 |     0.513 | 
##              |     0.673 |     0.377 |           | 
##              |     0.310 |     0.204 |           | 
## -------------|-----------|-----------|-----------|
##       female |        17 |        38 |        55 | 
##              |     2.728 |     2.326 |           | 
##              |     0.309 |     0.691 |     0.487 | 
##              |     0.327 |     0.623 |           | 
##              |     0.150 |     0.336 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |        52 |        61 |       113 | 
##              |     0.460 |     0.540 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  9.846588     d.f. =  1     p =  0.00170147 
## 
## Pearson's Chi-squared test with Yates' continuity correction 
## ------------------------------------------------------------
## Chi^2 =  8.697291     d.f. =  1     p =  0.003186833 
## 
## 
## $chisq
## 
##  Pearson's Chi-squared test
## 
## data:  t
## X-squared = 9.8466, df = 1, p-value = 0.001701

Yes, males were less likely to survive, and the pattern holds even for minors.Note that there is an almost the same number of male and female minors: 58 and 55. But while 69.1% of females survived, only 39.7% of males survived.

Let’s look at the distribution of Age separately for males and females

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

This chart really make it obvious - even among minors (look at 15 year olds), men were less likely to survive. It also suggests that even among minors, the older ones were less likely to survive. This calls for a regression to estimate the marginal effects on probability of survival.

How will the box chart looks like for minors?

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Looks similiar to the overall chart above. Shows that Minor or not, males were still less likely to survive.

Multivariate Plots And Analysis

Distribution of survival by age faceted by gender

Pretty stark difference by sex

Let’s build a Linear probability model first

Disclaimer: I will not create training/validation set and score as usual

## [1] "Let's see what will happen when we throw in Sex as well"
## [1] "Note that mtable is from memisc pkg"
## 
## Calls:
## m1: lm(formula = tf, data = t)
## m2: lm(formula = Survived ~ Age + Pclass + Sex, data = t)
## 
## ========================================
##                      m1         m2      
## ----------------------------------------
##   (Intercept)      1.240***   0.847***  
##                   (0.072)    (0.067)    
##   Age             -0.008***  -0.005***  
##                   (0.001)    (0.001)    
##   Pclass          -0.264***  -0.203***  
##                   (0.021)    (0.019)    
##   Sex: female                 0.479***  
##                              (0.031)    
## ----------------------------------------
##   R-squared           0.2        0.4    
##   adj. R-squared      0.2        0.4    
##   sigma               0.4        0.4    
##   F                  78.3      151.4    
##   p                   0.0        0.0    
##   Log-likelihood   -434.4     -328.9    
##   Deviance          141.1      105.0    
##   AIC               876.8      667.7    
##   BIC               895.1      690.6    
##   N                 714        714      
## ========================================
## [1] "Age and Pclass have negative sign, which makes sense"
## [1] "Being male has a really significant negative sign"

Let’s build a logistic regression and get the odds as well

## 
## Calls:
## l1: glm(formula = tf, family = binomial(link = "logit"), data = t)
## l2: glm(formula = Survived ~ Age + Pclass + Sex, family = binomial(link = "logit"), 
##     data = t)
## 
## ==============================================
##                            l1         l2      
## ----------------------------------------------
##   (Intercept)            3.585***   2.534***  
##                         (0.407)    (0.456)    
##   Age                   -0.042***  -0.037***  
##                         (0.007)    (0.008)    
##   Pclass                -1.244***  -1.289***  
##                         (0.119)    (0.139)    
##   Sex: female                       2.522***  
##                                    (0.207)    
## ----------------------------------------------
##   Aldrich-Nelson R-sq.      0.2        0.3    
##   McFadden R-sq.            0.1        0.3    
##   Cox-Snell R-sq.           0.2        0.4    
##   Nagelkerke R-sq.          0.2        0.5    
##   phi                       1.0        1.0    
##   Likelihood-ratio        137.1      317.2    
##   p                         0.0        0.0    
##   Log-likelihood         -413.7     -323.6    
##   Deviance                827.4      647.3    
##   AIC                     833.4      655.3    
##   BIC                     847.1      673.6    
##   N                       714        714      
## ==============================================
## [1] "Same signs as the LPM, good"
## [1] "Now let's exponentiate the coefficients to get the odds ratios"
##                     OR    2.5 %     97.5 %
## (Intercept) 12.6022495 5.227870 31.3646211
## Age          0.9637445 0.949163  0.9780179
## Pclass       0.2756716 0.208325  0.3599055
## Sexfemale   12.4551085 8.366012 18.8756140

Interpretation: an extra year of age decreases the odds of surviving a factor of 0.96. Going from class 1 to 2 or 2 to 3 decreases the odds by a factor of 0.27. Finally, going from female to male decreases the odds by a factor of 0.08 or 12.5 times. That is females were 12 times more likely to survive that males. Also, as we have seen above, this holds true even among minors (<18 years old) - females were more likely to survive.


Final Plots and Summary

Plot One: Bar chart of Survived and Gender

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]
## [1] "Here are the actual percentages for the charts above:"
## Survived
##  0  1 
## 62 38
## Sex
##   male female 
##     65     35

Plot One Discussion

Interesting, the distributions of Survived and Sex are almost identical.

That is 65% of passengers were male and 62% of passengers did not survive.

Plot Two: Distribution of Age by Survived separately by Sex

## TableGrob (2 x 1) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (2-2,1-1) arrange gtable[layout]

Plot Two Discussion

This plot looks like a mirror image: most of men did not survive, while most women did.

Plot Three: Linear probability model by Sex

Plot Three Discussion

The blue line can be interpreted as the predicted probability of surviving by age. For men this probability falls rapidly at 15 years old and stays pretty low. For women of all ages the probability of surviving is much higher.


Issues, Reflections, Conclusions (Speculations)

Issues: What we have is an incomplete data set - obviously there were more than 891 passengers on Titanic. We can make no assumptions about whether we got a random sample or a biased one. Therefore we cannot really make any general conclusions but we can speculate about the results.

There is missing data for Age. I did no imputations because I cannot make any assumptions whether the values are MCAR (missing completely at random). Those observations were just dropped by the R’s procedures when Age variable was invovled.

Chi-squared test of the frequency of survival based on gender (Sex) returned a very low p-value meaning this pattern could not have occurred by chance.

Both the linear and logistic probability models returned negative estimates for being male on survival.

The odds ratio of survival of being female comparing to being a male was 12.5.

This means that females were 12 times more likely to survive that males.

This pattern holds even for minors (<18 years old) - females were still more likely to survive.

This seems to suggest that in the good old saying “women and children first” boys of 15 years or older do not count as “children.”

Next steps? In the future I could review other tragic events and disasters and analyze factors that influence the probability of survival.

Appendix: Data Dictionary

VARIABLE DESCRIPTIONS:

survival Survival

(0 = No; 1 = Yes)

pclass Passenger Class

(1 = 1st; 2 = 2nd; 3 = 3rd)

name Name

sex Sex

age Age

sibsp Number of Siblings/Spouses Aboard

parch Number of Parents/Children Aboard

ticket Ticket Number

fare Passenger Fare

cabin Cabin

embarked Port of Embarkation

(C = Cherbourg; Q = Queenstown; S = Southampton)

The End.