EDA in SAS on Titanic data by Michael Eryan

Setup and import the data.

In [38]:
options nosource nonotes; * avoid most of the log output ;

libname t "/folders/myfolders/";

%let input=titanic_data.csv;
%let dataset=titanic;

/*
FILENAME REFFILE '/folders/myfolders/titanic_data.csv';
PROC IMPORT DATAFILE=REFFILE
	DBMS=CSV REPLACE
	OUT=t.titanic;
	GETNAMES=YES;
RUN;
*/
Out[38]:


General data set examination and preparation

Add the variable "Minor" and load the data set into macro variables to avoid typos. The data is a sample and incomplete, so we cannot make any generalizations or definitive conclusions. My goal is just exploration, observation and some speculation.

In [39]:
proc sql;
create table titanic as
select
*
,case when age is not null then 
	(case when age < 18 then 1 else 0 end)
	else . end as minor
from t.titanic
;quit;
Out[39]:


Automatically produced summary of the data set

The input data set is titanic_data.csv. Proc Contents provides all the info about the data set.

In [40]:
ods select Attributes Variables ;
proc contents data=&dataset ;
run;
ods select default;
Out[40]:
SAS Output

Estimated logit plot of passenger age

The CONTENTS Procedure

Data Set Name WORK.TITANIC Observations 891
Member Type DATA Variables 13
Engine V9 Indexes 0
Created 09/30/2017 15:37:52 Observation Length 152
Last Modified 09/30/2017 15:37:52 Deleted Observations 0
Protection   Compressed NO
Data Set Type   Sorted NO
Label      
Data Representation SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64    
Encoding utf-8 Unicode (UTF-8)    
Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
6 Age Num 8 BEST12. BEST32.
11 Cabin Char 4 $4. $4.
12 Embarked Char 1 $1. $1.
10 Fare Num 8 BEST12. BEST32.
4 Name Char 57 $57. $57.
8 Parch Num 8 BEST12. BEST32.
1 PassengerId Num 8 BEST12. BEST32.
3 Pclass Num 8 BEST12. BEST32.
5 Sex Char 6 $6. $6.
7 SibSp Num 8 BEST12. BEST32.
2 Survived Num 8 BEST12. BEST32.
9 Ticket Char 16 $16. $16.
13 minor Num 8    

Print a few rows of the data set

In [41]:
proc print data=&dataset (obs=10);
run;
Out[41]:
SAS Output

Estimated logit plot of passenger age

Obs PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked minor
1 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25   S 0
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C 0
3 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925   S 0
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 0
5 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05   S 0
6 6 0 3 Moran, Mr. James male . 0 0 330877 8.4583   Q .
7 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S 0
8 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075   S 1
9 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333   S 0
10 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708   C 1

Print a summary of all the numerical variables in the data

In [42]:
proc means data=&dataset n nmiss min mean median max range std fw=8 maxdec=2;
run;
Out[42]:
SAS Output

Estimated logit plot of passenger age

The MEANS Procedure

Variable N N Miss Minimum Mean Median Maximum Range Std Dev
PassengerId
Survived
Pclass
Age
SibSp
Parch
Fare
minor
891
891
891
714
891
891
891
714
0
0
0
177
0
0
0
177
1.00
0.00
1.00
0.42
0.00
0.00
0.00
0.00
446.00
0.38
2.31
29.70
0.52
0.38
32.20
0.16
446.00
0.00
3.00
28.00
0.00
0.00
14.45
0.00
891.00
1.00
3.00
80.00
8.00
6.00
512.33
1.00
890.00
1.00
2.00
79.58
8.00
6.00
512.33
1.00
257.35
0.49
0.84
14.53
1.10
0.81
49.69
0.37

Univariate Plots and Analysis

Single variable tabulations

In [43]:
proc format;
	value survfmt 1="Survived" 0="Died";
	value $sexfmt  "male"="1_Male" "female"="2_Female";
run;

proc sql;
select survived,count(*) as count from &dataset group by 1;
select sex,count(*) as count from &dataset group by 1;
select pclass,count(*) as count from &dataset group by 1;
select minor,count(*) as count from &dataset group by 1;
quit;
Out[43]:
SAS Output

Estimated logit plot of passenger age

Survived count
0 549
1 342

Estimated logit plot of passenger age

Sex count
female 314
male 577

Estimated logit plot of passenger age

Pclass count
1 216
2 184
3 491

Estimated logit plot of passenger age

minor count
. 177
0 601
1 113

Bar Charts

In [44]:
proc sgplot data=&dataset ;                                                                                                                 
   vbar survived / seglabel;                                                                                                         
run; 
proc sgplot data=&dataset;                                                                                                                 
   vbar sex / seglabel;                                                                                                        
run; 
proc sgplot data=&dataset;                                                                                                                 
   vbar pclass / seglabel;                                                                                                         
run; 
proc sgplot data=&dataset;                                                                                                                 
   vbar minor / seglabel;                                                                                                       
run; 
Out[44]:
SAS Output
The SGPlot Procedure

The SGPlot Procedure

The SGPlot Procedure

The SGPlot Procedure

Observations: more males than females, more died than survived, 3 is the poorest socio-econ class.

Fancier bar charts - Counts and Proportions of Gender vs Survival.

In [45]:
proc sgplot data=&dataset;                                                                                                                 
   vbar sex / group=survived seglabel seglabelattrs=(size=12);
   format survived survfmt.;                                                                                                         
run; 
Out[45]:
SAS Output
The SGPlot Procedure

Histogram with Density Plot of Age

In [46]:
proc sgplot data=&dataset;                                                                                                                 
   histogram age ;
   density age / type=kernel; 
   keylegend / location=inside position=topright;                                                                                                      
run; 
Out[46]:
SAS Output
The SGPlot Procedure

Bivariate Plots and Analysis

Let’s facet Age by Sex and Survival

In [47]:
proc sgpanel data = &dataset;
  panelby sex survived / columns = 2 rows =2;
  histogram age / scale=count;
  density age ;
  colaxis values= (0 to 80 by 5);
  format survived survfmt.;
run;
Out[47]:
SAS Output
The SGPanel Procedure

Almost mirror image, but is it statistically significant? Let’s test by Anova.

First, let's look at the mean of Age by gender*survival.

In [48]:
proc sql;
select sex,survived,mean(age) as mean_age
from &dataset
group by 1,2
order by 2,1 desc
;quit;
Out[48]:
SAS Output

Estimated logit plot of passenger age

Sex Survived mean_age
male 0 31.61806
female 0 25.04688
male 1 27.27602
female 1 28.84772

For survivors, the means are pretty close, but not so for dead. But is it significant?

In [49]:
data t1;
set &dataset;
where survived=1;
run;

ods select OverallANOVA;
proc anova data=t1 plots=none;
	class sex;
	model Age = sex;
run;
Out[49]:
SAS Output

Estimated logit plot of passenger age

The ANOVA Procedure

Dependent Variable: Age

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 156.05845 156.05845 0.70 0.4043
Error 288 64444.39050 223.76524    
Corrected Total 289 64600.44895      

Age difference is not significant among the survivors.

In [50]:
data t0;
set &dataset;
where survived=0;
run;
ods select OverallANOVA;
proc anova data=t0 plots=none;
	class sex;
	model Age = sex;
run;
ods select default;
Out[50]:
SAS Output

Estimated logit plot of passenger age

The ANOVA Procedure

Dependent Variable: Age

Source DF Sum of Squares Mean Square F Value Pr > F
Model 1 2346.40740 2346.40740 11.99 0.0006
Error 422 82612.59201 195.76444    
Corrected Total 423 84958.99941      

But definitely significant among the dead. Yes, the difference in age by sex is definitely significant. The men who died were older than the women who died.

Let’s look closer at survival sex pclass

In [51]:
proc freq data=&dataset;
	tables survived sex pclass
			sex * survived pclass*survived / 
			plots (only)=freqplot(scale=percent);
	format survived survfmt.;
run;
Out[51]:
SAS Output

Estimated logit plot of passenger age

The FREQ Procedure

Survived Frequency Percent Cumulative
Frequency
Cumulative
Percent
Died 549 61.62 549 61.62
Survived 342 38.38 891 100.00
Bar Chart of Percents for Survived
Sex Frequency Percent Cumulative
Frequency
Cumulative
Percent
female 314 35.24 314 35.24
male 577 64.76 891 100.00
Bar Chart of Percents for Sex
Pclass Frequency Percent Cumulative
Frequency
Cumulative
Percent
1 216 24.24 216 24.24
2 184 20.65 400 44.89
3 491 55.11 891 100.00
Bar Chart of Percents for Pclass
Frequency
Percent
Row Pct
Col Pct
Table of Sex by Survived
Sex Survived
Died Survived Total
female
81
9.09
25.80
14.75
233
26.15
74.20
68.13
314
35.24
 
 
male
468
52.53
81.11
85.25
109
12.23
18.89
31.87
577
64.76
 
 
Total
549
61.62
342
38.38
891
100.00
Bar Chart of Percents for Sex by Survived
Frequency
Percent
Row Pct
Col Pct
Table of Pclass by Survived
Pclass Survived
Died Survived Total
1
80
8.98
37.04
14.57
136
15.26
62.96
39.77
216
24.24
 
 
2
97
10.89
52.72
17.67
87
9.76
47.28
25.44
184
20.65
 
 
3
372
41.75
75.76
67.76
119
13.36
24.24
34.80
491
55.11
 
 
Total
549
61.62
342
38.38
891
100.00
Bar Chart of Percents for Pclass by Survived

We see that there might be a link between being male and not surviving. Let's do a test for association (contingency table test).

Hypothesis: There is an association between sex and survival also try class and age.

In [52]:
proc freq data=&dataset;
	tables 
			(sex pclass) * survived  / 
			chisq expected cellchi2 nocol nopercent relrisk;
	format survived survfmt.;
run;
Out[52]:
SAS Output

Estimated logit plot of passenger age

The FREQ Procedure

Frequency
Expected
Cell Chi-Square
Row Pct
Table of Sex by Survived
Sex Survived
Died Survived Total
female
81
193.47
65.386
25.80
233
120.53
104.96
74.20
314
 
 
 
male
468
355.53
35.583
81.11
109
221.47
57.12
18.89
577
 
 
 
Total
549
342
891

Statistics for Table of Sex by Survived

Statistic DF Value Prob
Chi-Square 1 263.0506 <.0001
Likelihood Ratio Chi-Square 1 268.8512 <.0001
Continuity Adj. Chi-Square 1 260.7170 <.0001
Mantel-Haenszel Chi-Square 1 262.7553 <.0001
Phi Coefficient   -0.5434  
Contingency Coefficient   0.4774  
Cramer's V   -0.5434  
Fisher's Exact Test
Cell (1,1) Frequency (F) 81
Left-sided Pr <= F <.0001
Right-sided Pr >= F 1.0000
   
Table Probability (P) <.0001
Two-sided Pr <= P <.0001
Odds Ratio and Relative Risks
Statistic Value 95% Confidence Limits
Odds Ratio 0.0810 0.0583 0.1124
Relative Risk (Column 1) 0.3180 0.2626 0.3852
Relative Risk (Column 2) 3.9280 3.2770 4.7084

Sample Size = 891

Frequency
Expected
Cell Chi-Square
Row Pct
Table of Pclass by Survived
Pclass Survived
Died Survived Total
1
80
133.09
21.178
37.04
136
82.909
33.997
62.96
216
 
 
 
2
97
113.37
2.3647
52.72
87
70.626
3.796
47.28
184
 
 
 
3
372
302.54
15.95
75.76
119
188.46
25.603
24.24
491
 
 
 
Total
549
342
891

Statistics for Table of Pclass by Survived

Statistic DF Value Prob
Chi-Square 2 102.8890 <.0001
Likelihood Ratio Chi-Square 2 103.5471 <.0001
Mantel-Haenszel Chi-Square 1 101.9668 <.0001
Phi Coefficient   0.3398  
Contingency Coefficient   0.3217  
Cramer's V   0.3398  

Sample Size = 891

Odds ratio (top row/bottom row) is significant at 0.08 - women are more likely than men to survive. 0.08 says that a female has about 8 of the odds of dying compared to a male. Alternatively males have 92 lower odds of surviving than females.

Does this pattern hold among the minors?

In [53]:
proc freq data=&dataset;
	tables 
			sex  * survived  / 
			chisq expected cellchi2 nocol nopercent relrisk;
	where minor=1;
	format survived survfmt.;
run;
Out[53]:
SAS Output

Estimated logit plot of passenger age

The FREQ Procedure

Frequency
Expected
Cell Chi-Square
Row Pct
Table of Sex by Survived
Sex Survived
Died Survived Total
female
17
25.31
2.7283
30.91
38
29.69
2.3257
69.09
55
 
 
 
male
35
26.69
2.5871
60.34
23
31.31
2.2054
39.66
58
 
 
 
Total
52
61
113

Statistics for Table of Sex by Survived

Statistic DF Value Prob
Chi-Square 1 9.8466 0.0017
Likelihood Ratio Chi-Square 1 10.0085 0.0016
Continuity Adj. Chi-Square 1 8.6973 0.0032
Mantel-Haenszel Chi-Square 1 9.7595 0.0018
Phi Coefficient   -0.2952  
Contingency Coefficient   0.2831  
Cramer's V   -0.2952  
Fisher's Exact Test
Cell (1,1) Frequency (F) 17
Left-sided Pr <= F 0.0015
Right-sided Pr >= F 0.9996
   
Table Probability (P) 0.0011
Two-sided Pr <= P 0.0024
Odds Ratio and Relative Risks
Statistic Value 95% Confidence Limits
Odds Ratio 0.2940 0.1352 0.6394
Relative Risk (Column 1) 0.5122 0.3276 0.8008
Relative Risk (Column 2) 1.7423 1.2115 2.5057

Sample Size = 113

Yes, it does.

In [54]:
proc sgpanel data = &dataset;
  panelby sex survived / columns = 2 rows =2;
  histogram age / scale=count;
  density age ;
  colaxis values= (0 to 18 by 3);
  where minor=1;
  format survived survfmt.;
run;
proc sgplot data=&dataset;                                                                                                                 
   vbar sex / group=survived seglabel seglabelattrs=(size=12);
   where minor=1;
   format survived survfmt.;                                                                                                         
run; 
Out[54]:
SAS Output
The SGPanel Procedure

The SGPlot Procedure

Looks similiar to the overall chart above. Shows that Minor or not, males were still less likely to survive.

Detecting Ordinal Associations - survival and socio-economic class.

In [55]:
proc freq data=&dataset;
	tables 
			pclass * survived  / 
			chisq measures cl;
	format survived survfmt.;
run;
Out[55]:
SAS Output

Estimated logit plot of passenger age

The FREQ Procedure

Frequency
Percent
Row Pct
Col Pct
Table of Pclass by Survived
Pclass Survived
Died Survived Total
1
80
8.98
37.04
14.57
136
15.26
62.96
39.77
216
24.24
 
 
2
97
10.89
52.72
17.67
87
9.76
47.28
25.44
184
20.65
 
 
3
372
41.75
75.76
67.76
119
13.36
24.24
34.80
491
55.11
 
 
Total
549
61.62
342
38.38
891
100.00

Statistics for Table of Pclass by Survived

Statistic DF Value Prob
Chi-Square 2 102.8890 <.0001
Likelihood Ratio Chi-Square 2 103.5471 <.0001
Mantel-Haenszel Chi-Square 1 101.9668 <.0001
Phi Coefficient   0.3398  
Contingency Coefficient   0.3217  
Cramer's V   0.3398  
Statistic Value ASE 95%
Confidence Limits
Gamma -0.5486 0.0431 -0.6330 -0.4643
Kendall's Tau-b -0.3235 0.0303 -0.3830 -0.2641
Stuart's Tau-c -0.3433 0.0327 -0.4074 -0.2791
Somers' D C|R -0.2885 0.0273 -0.3420 -0.2349
Somers' D R|C -0.3628 0.0341 -0.4296 -0.2960
Pearson Correlation -0.3385 0.0319 -0.4011 -0.2759
Spearman Correlation -0.3397 0.0319 -0.4021 -0.2772
Lambda Asymmetric C|R 0.1637 0.0393 0.0867 0.2408
Lambda Asymmetric R|C 0.0425 0.0391 0.0000 0.1191
Lambda Symmetric 0.0984 0.0353 0.0292 0.1676
Uncertainty Coefficient C|R 0.0873 0.0167 0.0546 0.1199
Uncertainty Coefficient R|C 0.0582 0.0112 0.0364 0.0801
Uncertainty Coefficient Symmetric 0.0699 0.0134 0.0437 0.0960

Sample Size = 891

Yes, there is a relationsip between class and survival chisq is significant and spearman is -0.3397.

Multivariate Plots And Analysis

Distribution of survival by age faceted by gender

In [56]:
proc sgplot data=&dataset;
 	pbspline x=age y=survived / group=sex; 
 	where age <70;          
run;
Out[56]:
SAS Output
The SGPlot Procedure

Does not look very similiar to R's plot, but proves the same point.

Let’s build a Linear probability model first.

In [57]:
proc glm data=&dataset alpha=0.5;
	class sex pclass;
	model survived  = age sex pclass /solution;
	title 'Linear probability model: survived  = age sex pclass';
run;
Out[57]:
SAS Output

Linear probability model: survived = age sex pclass

The GLM Procedure

Class Level Information
Class Levels Values
Sex 2 female male
Pclass 3 1 2 3
Number of Observations Read 891
Number of Observations Used 714

Linear probability model: survived = age sex pclass

The GLM Procedure

Dependent Variable: Survived

Source DF Sum of Squares Mean Square F Value Pr > F
Model 4 67.1939857 16.7984964 113.41 <.0001
Error 709 105.0188995 0.1481226    
Corrected Total 713 172.2128852      
R-Square Coeff Var Root MSE Survived Mean
0.390180 94.75690 0.384867 0.406162
Source DF Type I SS Mean Square F Value Pr > F
Age 1 1.02692222 1.02692222 6.93 0.0086
Sex 1 49.09856250 49.09856250 331.47 <.0001
Pclass 2 17.06850095 8.53425047 57.62 <.0001
Source DF Type III SS Mean Square F Value Pr > F
Age 1 3.76045774 3.76045774 25.39 <.0001
Sex 1 36.08573279 36.08573279 243.62 <.0001
Pclass 2 17.06850095 8.53425047 57.62 <.0001
Parameter Estimate   Standard
Error
t Value Pr > |t|
Intercept 0.2389468762 B 0.03626494 6.59 <.0001
Age -.0054600623   0.00108365 -5.04 <.0001
Sex female 0.4794556708 B 0.03071788 15.61 <.0001
Sex male 0.0000000000 B . . .
Pclass 1 0.4066179863 B 0.03828823 10.62 <.0001
Pclass 2 0.1988705476 B 0.03640832 5.46 <.0001
Pclass 3 0.0000000000 B . . .

Note:The X'X matrix has been found to be singular, and a generalized inverse was used to solve the normal equations. Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

Analysis of Covariance for Survived by Age categorized by Sex * Pclass

Signs make sense: Age has a negative sign, being female has a positive sign, being of higher social class has a positive sign on probability of survival.

Let’s build a logistic regression and get the odds as well.

In [58]:
proc logistic data=&dataset alpha=0.5
	plots (only) = (effect oddsratio);
	model survived (event='1') = age / clodds=pl;
	title 'Logistic Model (1): survived=age';
run;
Out[58]:
SAS Output

Logistic Model (1): survived=age

The LOGISTIC Procedure

Model Information
Data Set WORK.TITANIC
Response Variable Survived
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 891
Number of Observations Used 714
Response Profile
Ordered
Value
Survived Total
Frequency
1 0 424
2 1 290

Probability modeled is Survived='1'.

Note:177 observations were deleted due to missing values for the response or explanatory variables.

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 966.516 964.228
SC 971.087 973.370
-2 Log L 964.516 960.228
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 4.2876 1 0.0384
Score 4.2577 1 0.0391
Wald 4.2310 1 0.0397
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept 1 -0.0567 0.1736 0.1068 0.7438
Age 1 -0.0110 0.00533 4.2310 0.0397
Association of Predicted Probabilities and Observed Responses
Percent Concordant 52.1 Somers' D 0.062
Percent Discordant 45.9 Gamma 0.063
Percent Tied 2.0 Tau-a 0.030
Pairs 122960 c 0.531
Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 50% Confidence Limits
Age 1.0000 0.989 0.986 0.993
Plot of Odds Ratios with 50% Profile-Likelihood Confidence Limits
Plot of Predicted Probabilities for Survived=1 by Age, with 50% Confidence Limits.

Odds ratio is 0.992 - so by increasing age by 1 year we decrease odds of surviving by 0.8.

Logistic model with a categorical predictor.

In [59]:
proc logistic data=&dataset alpha=0.5
	plots (only) = (effect oddsratio);
	class sex(ref='male') pclass(ref='3') / param=ref;
	model survived (event='1') = age sex pclass / clodds=pl;
	units age=10;
	title 'Logistic Model (2): survived=age sex class';
run;
Out[59]:
SAS Output

Logistic Model (2): survived=age sex class

The LOGISTIC Procedure

Model Information
Data Set WORK.TITANIC
Response Variable Survived
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 891
Number of Observations Used 714
Response Profile
Ordered
Value
Survived Total
Frequency
1 0 424
2 1 290

Probability modeled is Survived='1'.

Note:177 observations were deleted due to missing values for the response or explanatory variables.

Class Level Information
Class Value Design Variables
Sex female 1  
  male 0  
Pclass 1 1 0
  2 0 1
  3 0 0
Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 966.516 657.283
SC 971.087 680.138
-2 Log L 964.516 647.283
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 317.2328 4 <.0001
Score 278.5884 4 <.0001
Wald 191.2608 4 <.0001
Type 3 Analysis of Effects
Effect DF Wald
Chi-Square
Pr > ChiSq
Age 1 23.3376 <.0001
Sex 1 147.9716 <.0001
Pclass 2 85.6039 <.0001
Analysis of Maximum Likelihood Estimates
Parameter   DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept   1 -1.3264 0.2479 28.6297 <.0001
Age   1 -0.0370 0.00766 23.3376 <.0001
Sex female 1 2.5228 0.2074 147.9716 <.0001
Pclass 1 1 2.5806 0.2814 84.0755 <.0001
Pclass 2 1 1.2708 0.2440 27.1156 <.0001
Association of Predicted Probabilities and Observed Responses
Percent Concordant 85.1 Somers' D 0.705
Percent Discordant 14.6 Gamma 0.706
Percent Tied 0.3 Tau-a 0.340
Pairs 122960 c 0.852
Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 50% Confidence Limits
Age 10.0000 0.691 0.656 0.727
Sex female vs male 1.0000 12.463 10.847 14.350
Pclass 1 vs 3 1.0000 13.205 10.939 15.992
Pclass 2 vs 3 1.0000 3.564 3.024 4.204
Plot of Odds Ratios with 50% Profile-Likelihood Confidence Limits
Plot of Predicted Probabilities for Survived=1 by Age, sliced by Sex*Pclass.

Interpretation of odds ratios: Increasing age by 10 years decreases the odds of surviving by 0.691, going from male to female increases the odds by 12.463, going from Pllass=3 to 1 increases the odds by 13.205 etc.

Logistic regression: backward elimination with interactions.

In [60]:
proc logistic data=&dataset alpha=0.5
	plots (only) = (effect oddsratio);
	class sex(ref='male') pclass(ref='3') / param=ref;
	model survived (event='1') = age|sex|pclass @2/ selection=backward slstay=0.01 clodds=pl;
	units age=10;
	title 'Logistic Model (3): Backward elimination, survived=age sex class';
run;
Out[60]:
SAS Output

Logistic Model (3): Backward elimination, survived=age sex class

The LOGISTIC Procedure

Model Information
Data Set WORK.TITANIC
Response Variable Survived
Number of Response Levels 2
Model binary logit
Optimization Technique Fisher's scoring
Number of Observations Read 891
Number of Observations Used 714
Response Profile
Ordered
Value
Survived Total
Frequency
1 0 424
2 1 290

Probability modeled is Survived='1'.

Note:177 observations were deleted due to missing values for the response or explanatory variables.

Backward Elimination Procedure

Class Level Information
Class Value Design Variables
Sex female 1  
  male 0  
Pclass 1 1 0
  2 0 1
  3 0 0

Step 0. The following effects were entered:

Intercept Age Sex Age*Sex Pclass Age*Pclass Sex*Pclass

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 966.516 624.335
SC 971.087 670.044
-2 Log L 964.516 604.335
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 360.1806 9 <.0001
Score 307.0826 9 <.0001
Wald 160.2179 9 <.0001

Step 1. Effect Age*Sex is removed:

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 966.516 624.925
SC 971.087 666.063
-2 Log L 964.516 606.925
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 357.5914 8 <.0001
Score 305.0076 8 <.0001
Wald 155.8008 8 <.0001
Residual Chi-Square Test
Chi-Square DF Pr > ChiSq
2.6085 1 0.1063

Step 2. Effect Age*Pclass is removed:

Model Convergence Status
Convergence criterion (GCONV=1E-8) satisfied.
Model Fit Statistics
Criterion Intercept Only Intercept and Covariates
AIC 966.516 627.427
SC 971.087 659.423
-2 Log L 964.516 613.427
Testing Global Null Hypothesis: BETA=0
Test Chi-Square DF Pr > ChiSq
Likelihood Ratio 351.0893 6 <.0001
Score 304.0535 6 <.0001
Wald 165.8907 6 <.0001
Residual Chi-Square Test
Chi-Square DF Pr > ChiSq
8.5217 3 0.0364

Note:No (additional) effects met the 0.01 significance level for removal from the model.

Summary of Backward Elimination
Step Effect
Removed
DF Number
In
Wald
Chi-Square
Pr > ChiSq
1 Age*Sex 1 5 2.5867 0.1078
2 Age*Pclass 2 4 5.8291 0.0542
Joint Tests
Effect DF Wald
Chi-Square
Pr > ChiSq
Age 1 26.2539 <.0001
Sex 1 28.2477 <.0001
Pclass 2 43.8036 <.0001
Sex*Pclass 2 28.8486 <.0001

Note:Under full-rank parameterizations, Type 3 effect tests are replaced by joint tests. The joint test for an effect is a test that all the parameters associated with that effect are zero. Such joint tests might not be equivalent to Type 3 effect tests under GLM parameterization.

Analysis of Maximum Likelihood Estimates
Parameter     DF Estimate Standard
Error
Wald
Chi-Square
Pr > ChiSq
Intercept     1 -0.7059 0.2575 7.5153 0.0061
Age     1 -0.0419 0.00818 26.2539 <.0001
Sex female   1 1.4466 0.2722 28.2477 <.0001
Pclass 1   1 1.9765 0.3099 40.6808 <.0001
Pclass 2   1 0.1473 0.3385 0.1892 0.6636
Sex*Pclass female 1 1 2.1926 0.6853 10.2380 0.0014
Sex*Pclass female 2 1 2.8615 0.5910 23.4394 <.0001
Association of Predicted Probabilities and Observed Responses
Percent Concordant 86.0 Somers' D 0.723
Percent Discordant 13.7 Gamma 0.724
Percent Tied 0.3 Tau-a 0.349
Pairs 122960 c 0.861
Odds Ratio Estimates and Profile-Likelihood Confidence Intervals
Effect Unit Estimate 50% Confidence Limits
Age 10.0000 0.657 0.622 0.695
Plot of Odds Ratios with 50% Profile-Likelihood Confidence Limits
Plot of Predicted Probabilities for Survived=1 by Age, sliced by Sex*Pclass.

Plot empirical estimated logits - to verify if the assumption of linear holds.

In [61]:
proc means data=&dataset noprint nway;
	class pclass;
	var survived;
	output out=bins1 sum(survived)=nevent n(survived)=ncases;
run;
data bins2;
	set bins1;
	logit=log((nevent + 1)/(ncases - nevent+1));
run;
proc sgplot data=bins2;
	reg y=logit x=pclass / markerattrs=(symbol=asterisk color=blue size=15);
	pbspline y=logit x=pclass / nomarkers;
	xaxis integer;
	title "Estimated logit plot of passenger class";
run;
quit;
Out[61]:
SAS Output
The SGPlot Procedure

Looks sort of linear, so logistic regression is applicable. Same but for a continuous variable - bin into 50 groups of at least 20 obs to have enough obs for logit.

In [62]:
proc rank data=&dataset groups=50 out=ranks;
	var age;
	ranks Rank;
run;
proc means data=ranks noprint nway;
	class rank;
	var survived age;
	output out=bins3 sum(survived)=nevent n(survived)=ncases mean(age)=age;
run;
data bins4;
	set bins3;
	logit=log((nevent + 1)/(ncases - nevent+1));
run;
proc sgplot data=bins4;
	reg y=logit x=age/ markerattrs=(symbol=asterisk color=blue size=15);
	pbspline y=logit x=age/ nomarkers;
	xaxis integer;
	title "Estimated logit plot of passenger age";
run;
quit;
Out[62]:
SAS Output
The SGPlot Procedure

Nonlinearity detected - perhaps we need to transform Age before plugging into a logistic regression.

Final Plots and Summary.

Plot One: Bar chart of Survived and Gender.

In [63]:
proc sgplot data=&dataset;                                                                                                                 
   vbar survived / seglabel fillattrs= (color= lightgrey);
   xaxis discreteorder=formatted; 
   format survived survfmt.;                                                                                                        
run; 
proc sgplot data=&dataset ;                                                                                                                 
   vbar sex / seglabel fillattrs= (color= pink);
   xaxis discreteorder=formatted; 
   format sex $sexfmt.;                                                                                                        
run; 

proc sql noprint;
select count(*) into :total from &dataset;
quit;
proc sql;
select survived,count(*) as count, count(*) / &total  as percent format=percent10. from &dataset group by 1;
select sex,count(*) as count, count(*) / &total  as percent format=percent10. from &dataset group by 1 order by sex desc;
quit;
Out[63]:
SAS Output
The SGPlot Procedure

The SGPlot Procedure

Estimated logit plot of passenger age

Survived count percent
0 549 62%
1 342 38%

Estimated logit plot of passenger age

Sex count percent
male 577 65%
female 314 35%

Plot One Discussion. Interesting, the distributions of Survived and Sex are almost identical. That is 65pct of passengers were male and 62pct of passengers did not survive. Is there a link between being male and not surviving?

Plot Two: Distribution of Age by Survived separately by Sex.

In [64]:
proc sgpanel data = &dataset;
  panelby sex survived / columns = 2 rows =2;
  histogram age / scale=count;
  density age ;
  colaxis values= (0 to 80 by 5);
  format survived survfmt. sex $sexfmt.;
run;
Out[64]:
SAS Output
The SGPanel Procedure

Plot Two Discussion. This plot looks like a mirror image: most of men did not survive, while most women did.

Plot Three: Linear probability model by Sex.

In [65]:
proc sgplot data=&dataset;
 	pbspline x=age y=survived / group=sex;   
 	format survived survfmt. ;
 	where age <70;        
run;
Out[65]:
SAS Output
The SGPlot Procedure

Plot Three Discussion.

The blue line can be interpreted as the predicted probability of surviving by age. For men this probability falls rapidly at 15 years old and stays pretty low. For women of all ages the probability of surviving is much higher.

Issues, Reflections, Conclusions (Speculations)

Issues: What we have is an incomplete data set - obviously there were more than 891 passengers on Titanic. We can make no assumptions about whether we got a random sample or a biased one. Therefore we cannot really make any general conclusions but we can speculate about the results.

There is missing data for Age. I did no imputations because I cannot make any assumptions whether the values are MCAR (missing completely at random). Those observations were just dropped by the R’s procedures when Age variable was invovled. My exploratory data analysis suggested a link between survival and gender, so I pursued this and conducted statistical tests. Chi-squared test of the frequency of survival based on gender (Sex) returned a very low p-value meaning this pattern could not have occurred by chance.

Both the linear and logistic probability models returned negative estimates for being male on survival. The odds ratio of survival of being female comparing to being a male was 12.5. This means that females were 12 times more likely to survive that males.

This pattern holds even for minors (<18 years old) - females were still more likely to survive. This seems to suggest that in the good old saying “women and children first” boys of 15 years or older do not count as “children.” Next steps? In the future I could review other tragic events and disasters and analyze factors that influence the probability of survival.

Appendix: Data Dictionary

Data source: https://www.kaggle.com/c/titanic/data

VARIABLE DESCRIPTIONS:

survival Survival (0 = No; 1 = Yes) pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd) name Name sex Sex age Age sibsp Number of Siblings/Spouses Aboard parch Number of Parents/Children Aboard ticket Ticket Number fare Passenger Fare cabin Cabin embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

The End.