1 Introduction

Survival analysis refers to the analysis of the time to an event. The response is often referred to as a failure time, survival time, or event time.
There are many possible applications in various fiels, here’s some examples of application :

  • Time until tumor recurrence

  • Time until death after transplantation

  • Duration of use of a service by customers

  • Time to failure of an electronic component

The underlying questions may be :

  • What is the probability of having a tumor recurrence after the end of treatment ?

  • What is the efficience of heart transplantation ?

  • What feature will improve the duration of use of a service by customers ?

  • How to predict the time of failure of an electronic component ?

The survival time response is continuous and often subject to censoring (not complete information).

In the following, we will present the main survival estimators and their application to a simple survival model (cancer data set) :

Simple survival model (2 states)

Simple survival model (2 states)

1.1 Terminology and notations

1.1.1 Generality

To begin with, the variables to be defined for survival analysis are presented below: For each individual i, the needed information are :

  • \(X_{i}\) is the survival time

  • \(C_{i}\) is the time to censoring

But, in practice we have only access to

  • \(T_{i}\) is the time of event trully observed

  • \(\delta_{i}\) is the indicator of censoring : \(\delta_{i}=1\) if \(T_{i}=X_{i}\) and \(\delta_{i}=0\) if \(T_{i}=C_{i}\)

1.1.2 Censoring and truncation

It occurs when the event of interest is incomplete.
Censored or truncated data come from the fact that we don’t have access to all the information: instead of observing independent and identically distributed realizations (i.i.d.) of durations X, we observe the realization of the variable X subjected to various disturbances, independent or not of the phenomenon under study.

1.1.2.1 Censoring

Censoring is the phenomenon the most observed when collecting data in survival analysis.
The time to event is not known exactly but other information is available.

There are three different types of censoring:

  • Right censoring: the event of interest has not occurred at the time of the analysis.

  • Left censoring: the event of interest has occurred before the beginning of the study.

  • Interval censoring: the event of interest has occurred between two dates.

The most common type of censoring is right censoring.

1.1.2.2 Right censoring

The time of event is considered as right censored if the event has not occurred in his last observation. Thus, the survival time is not known exactly, but we know that it is greater than the time of censoring. There are three types of right censoring:

  • Type I censoring: the event of interest has not occurred at the time of the analysis C, thus the observation is censored at the end of the study time.

  • Type II censoring : The study is stopped after observing a fixed number of events : k events among n observations. Thus, the observation is censored at the time of the kth event.

  • Type III or type I random censoring (The main type): the study is designed to end after C years, but censored subjects do not all have the same censoring time. In considering \(C_{1},C_{2},...,C_{n}\) random variables i.i.d.
    The trully observed time is defined as :

\[ T_{i} = min(X_{i},C_{i}) \]

The information can be summarize as :

  • \(T_{i}\) the trully observed time

  • Indicator of censoring : \(\delta_{i}=1\) if \(T_{i}=X_{i}\) and \(\delta_{i}=0\) if \(T_{i}=C_{i}\)

Example for Type III :

  • During a clinical trial, the study may be designed to end after 5 years, but some patients may be lost to follow-up after 2 years, others after 3 years, etc. Thus, the observation is censored at the time of the last observation.

  • Change in treatment assignment during a clinical trial.

  • End of study before the occurrence of the event of interest.

Note : In the following examples, we will consider that the censoring is right censoring.

1.1.2.3 Left censoring

The time of event is considered as left censored if the event has occurred before the beginning of the study (\(t_{0}\)). Thus, the survival time is not known exactly, but we know that it is less than the time of censoring.

The information can be summarize as :

  • \(T_{i}\) the trully observed time

  • Indicator of censoring : \(\delta_{i}=1\) if \(T_{i}= X_{i}\) and \(\delta_{i}=0\) if \(T_{i} < t_{0}\)

Same as right censoring, left censoring is supposed to be independent of the event of interest.

Note : Few studies have focused on left-censoring

1.1.2.4 Interval censoring

The time of event is considered as interval censored if the event has occurred in an interval. Instead of observing an exact time of event, the only information available is that the event has occurred between two dates.
Example : In a cohort, patients are followed up intermittently (ex: every 6 months). The time of interest can appaear between two visits.

Note : Most of the time, the time of the event is considering as time of visit to simplify analysis and come back to right-censoring

1.1.2.5 Truncation

Truncation is different from censoring because it concerns the sample of population. A variable \(X\) is truncated by \(A\) if, intead of observing \(X\) for all individuals, we only observe \(X\) if \(X \leq A\).
The selection bias is a special case of truncation.

1.1.3 Function of interest

In considering a right censoring, most of the time, the needed informations in the data are :

  • \(T_{i}\) the trully observed time : \(T_{i}\)= min(\(X_{i},C_{i}\))

  • Indicator of censoring : \(\delta_{i}=1\) if \(T_{i}=X_{i}\) and \(\delta_{i}=0\) if \(T_{i}=C_{i}\)

Some functions can be defined, which will most often be the objects of interest in survival analysis:

  • The survival function: \(S(t)=P(X>t)\).

  • The distribution function: \(F(t)=P(X \leq t)=1-S(t)\).

  • The derivative of the distribution function: \(f(t)=F'(t)=lim_{h \to 0} \frac{P(t \leq X \leq t+h)}{h}\)

  • The instantaneous hazard function: \(\lambda(t)=\lim_{h \to 0} \frac{P(t \leq X \leq t+h | X \geq t)}{h} = \frac{f(t)}{S(t)}\)

  • The cumulative hazard function: \(\Lambda(t)=\int_{0}^{t} \lambda(u) du = -\ln(S(t))\).

1.2 Library

For the survival model, the R package survival (cancer data) was used.

# General package for survival analysis 
library(survival)
library(ggplot2)
library(survminer)
library(gridtext)
library(dplyr)
library(genSurv)
library(caret)
library(survAUC)
library(survival)
library(fitdistrplus)
library(grf)
library(DiagrammeR)
library(pec)
library(randomForestSRC)
library(pec)
library(checkmate)
library(SparseM)
library(Matrix)
library(doParallel)

1.3 Dataset

In this example, we will use lung cancer data set from library(survival). This data set contains the following variables:

  • time: Survival time in days

  • status: censoring status 0=censored, 1=dead

  • age: Age in years

  • sex: Male=1 Female=2

  • ph.ecog: ECOG performance score as rated by the physician. 0=asymptomatic, 1= symptomatic but completely ambulatory, 2= in bed <50% of the day, 3= in bed > 50% of the day but not bedbound, 4 = bedbound

  • ph.karno: Karnofsky performance score (bad=0-good=100) rated by physician

  • pat.karno: Karnofsky performance score as rated by patient

  • meal.cal: Calories consumed at meals

  • wt.loss: Weight loss in last six months (pounds)

Here the head of the dataset and the dimension :

time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
306 1 74 1 1 90 100 1175 NA
455 1 68 1 0 90 90 1225 15
1010 0 56 1 0 90 90 NA 15
210 1 57 1 1 90 60 1150 11
883 1 60 1 0 100 90 NA 0
1022 0 74 1 1 50 80 513 0
## [1] 228  10

Each row correspond to a patient. It’s basically a time to event dataset. The event of interest is death and the “time” is the survival time. The “status” variable is the censoring indicator. The other variables are covariates.

Illustration of time to event data

Illustration of time to event data

Example of the distribution of follow-up times according to event status:

The number of censored patients is higher than the number of dead patients in this dataset.

Censored data are not missing data. They are incomplete data.
That’s why we can’t use the same methods as for missing data. Some methods are specific to survival analysis. The following estimators will help to evaluate functions of interest in taking into account the censored information.

2 Survival analysis estimators

2.1 Non parametric estimators

The estimators are considered non-parametric because they make no assumptions about the underlying distribution of the data.
The advantage of this is that it’s very flexible, and model complexity grows with the number of observations.
There are two disadvantages: it isn’t easy to incorporate covariates : The main way to do it is to fit a different model on different subpopulations and compare them.

Note : In the following examples, we will consider that the censoring is right censoring and that the censoring is i.i.d (independent and identically distributed) and not informative.

2.1.1 Nelson-aalen estimator

The Nelson-Aalen estimator is a non-parametric estimator of the cumulative hazard function. It is defined as follows: \[\Lambda(t)=\int_{0}^{t} \lambda(u) du = -\ln(S(t))\]

In case of simple survival model, the definition can be more simple. Basically, it’s the sum of the hazard function at each death time :

\[\Lambda(t)=\sum_{i:t_{i} \leq t} \frac{d_{i}}{n_{i}}\] avec \(d_{i}\) the number of deceased at \(t_{i}\) and \(n_{i}\) the number of patients at risk just before \(t_{i}\).
At each death time, the event brings information to the estimator. The corresponding function is a step function.

# ce code permet de calculer l'estimateur de
# Nelson Aalen pour la base de donnée "Cancer"
fit <- survfit(Surv(time, status) ~ 1, data = cancer)
print(head(fit$cumhaz))
## [1] 0.004385965 0.017601824 0.022066110 0.031034720 0.035559606 0.040105061
ggsurvplot(fit, data = cancer, conf.int = TRUE, fun="cumhaz")

The plot of the cumulative hazard function is a step function. Each step corresponds to a death time. The cumulative hazard function is not a probability. It’s a cumulative function (that’s why the function is superior to 1). A higher value indicates higher cumulative risk. The curve itself represents the accumulation of risk over time.
Also, each visible cross in the plot corresponds to a censored time.

In our case : the slope of the curve does not change over time. The risk of death seems constant over time until t=900.

The confidential interval is calculated in using counting processus and approximated by a poisson distribution :

\(\hat{Var}(\hat{\Lambda}(t))=\sum_{i:t_{i} \leq t} \frac{d_{i}}{(Y_{i})^{2}}\) avec \(Y_{i}\) the number of individuals at risk at \(t_{i}\). thus : \(CI_{95\%}(\hat{\Lambda}(t))=\hat{\Lambda}(t) \pm 1.96 \sqrt{\hat{Var}(\hat{\Lambda}(t))}\)

The cumulative hazard function can be easily calculated :
Calculation of Cumulative Hazard Function

Calculation of Cumulative Hazard Function

For the patient 73, 79 and 108, the instantaneous hazard are equal because the time of death occured at the same time and Nelson-Aalen estimator considered \(n_{i}\) as the number of patients at risk just before \(t_{i}\) : That’s why the denominator is equal to 227 for the three because just before time 11, there are 227 patients at risk.

The Nelson-Aalen estimator is nearly not biased, uniformly consistent and asymptotically normal.

2.1.2 Kaplan-Meier estimator

Kaplan-Meier estimator is the most known estimator in survival analysis. It is a non-parametric estimator of the survival function.
The survival function can be defined as \(S(t)=P(X>t)\) in the case of a survival model. This means that the survival function represents the probability of surviving to time t.

The Kaplan-Meier estimator is defined as follows:
\(S(t)=\prod_{i:t_{i} \leq t} (1-\frac{d_{i}}{n_{i}})\) avec \(d_{i}\) the number of deceased at \(t_{i}\) and \(n_{i}\) the number of patients at risk just before \(t_{i}\).
It is also equal to \(S(t)=exp(-\Lambda(t))\) (easily calculated from the Nelson-Aalen estimator).

This code is used to calculate the Kaplan Meier estimator for the “Cancer” database. There is two way of coding Kaplan meier estimator :

# With coxph function but without covariates
kaplan <- coxph(Surv(time, status) ~ 1, data = cancer)

# With survfit function
kaplan2 <- survfit(Surv(time, status) ~ 1, data = cancer)
ggsurvplot(
  kaplan2,                # Données de survie
  surv.median.line = "hv",     # Ajoute les lignes médianes horizontales et verticales
  conf.int = TRUE,             # Affiche les intervalles de confiance
  risk.table = TRUE,           # Affiche le tableau des risques en bas du graphique
  risk.table.title = "Risks",  # Titre du tableau des risques
  title = "Survival Curves",  # Titre du graphique
  xlab = "Time (days)",        # Étiquette de l'axe x
  ylab = "Survival Probability" # Étiquette de l'axe y
)

The confidential interval is calculated with the Greenwood formula :
\(\hat{Var}(\hat{S}(t))=\hat{S}(t)^{2} \sum_{i:t_{i} \leq t} \frac{d_{i}}{n_{i}(n_{i}-d_{i})}\)
Thus : \(CI_{95\%}(\hat{S}(t))=\hat{S}(t) \pm 1.96 \sqrt{\hat{Var}(\hat{S}(t))}\)

The Kaplan-Meier estimator is nearly not biased (when the number of individuals at risk is high), uniformly consistent and asymptotically normal.

It exists other non parametric estimators such as Breslow estimator or Harrington and Fleming estimator but these two estimators are less used than Nelson-Aalen and Kaplan-Meier estimators.

As said before, the non-parametric estimators can’t incorporate covariate in their model. Another way of doing is to stratify the population according to the covariate and compare the different survival curves in using log-rank test.

For example, let’s compare the survival curves of women and men :

# Comparision of two survival curves : women and men (sex=1 (men) sex=2 (women))
cancer_women <- cancer %>%
dplyr::filter(sex == 2)

cancer_men <- cancer %>%
dplyr::filter(sex == 1)

kaplan_women <- survfit(Surv(time, status) ~ 1, data = cancer_women)
kaplan_men <- survfit(Surv(time, status) ~ 1, data = cancer_men)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Log-rank test is used to compare the survival curves of two groups in comparing the total number of events. The statistic is defined as follows :

  • Null hypothesis : \(H_{0}\) : the survival curves are equal (the proportion of events among subject at risk is the same in the two groups)

  • Alternative hypothesis : \(H_{1}\) : the survival curves are not equal

In our case :
At each timepoint, the information can be summarized as follows : Let’s say that we have two groups : A (woman) and B (man).

Groups Number of deseased at \(T_{i}\) Number of alive after \(T_{i}\) Number of individuals at risk just before \(T_{i}\)
Sexe = Woman \(d_{A_{i}}\) \(Y_{A_{i}} - d_{A_{i}}\) \(Y_{A_{i}}\)
Sexe = Man \(d_{B_{i}}\) \(Y_{B_{i}} - d_{B_{i}}\) \(Y_{B_{i}}\)
Total \(d_{i}\) \(Y_{i} - d_{i}\) \(Y_{i}\)

Let’s introduce the variable \(D_{A_{i}}\) : the random variable which count the number of events in group A at time \(T_{i}\).
\(D_{A_{i}}\) follow a hypergeometric distribution : \(D_{A_{i}} \sim H(Y_{i}=Y_{A_{i}}+Y_{B_{i}}, Y_{A_{i}},\frac{d_{i}}{Y_{i}})\)
with \(E(D_{A_{i}})=\frac{Y_{A_{i}}*d_{i}}{Y_{i}}\) and
\(V(D_{A_{i}})=\frac{Y_{A_{i}}*Y_{B_{i}}*d_{i}*(Y_{i}-d_{i})}{Y_{i}^{2}(Y_{i}-1)}\).

A hypergeometric distribution can be approximated by a binomial distribution in considering that \(Y_{i}\) is large enough. In addition, a binomial distribution can also be approximated by a normal distribution. Thus :

\[Z_{i}= \frac{(D_{A_{i}}-E(D_{A_{i}}))}{\sqrt{Var(D_{A_{i}})}}\]

The statistic Z follows a normal distribution : \(Z_{i} \sim N(0,1)\)

Then, in applying the square and for each time:

\[Z^{2} = \sum_{i=1}^{k} Z_{i}^{2}=\sum_{i=1}^{k} \frac{(D_{A_{i}}-E(D_{A_{i}}))^{2}}{Var(D_{A_{i}})}\]

The statistic \(Z^{2}\) follows a chi-2 distribution : \(Z^{2} \sim \chi^{2}(0,1)\) (in our case at 1 degree of freedom).

The function used for log rank test is survdiff() from survival package :

# Test du log-rank to test the equality of the survival curves
logrank_test <- survdiff(Surv(time, status) ~ sex, data = cancer)
print(logrank_test)
## Call:
## survdiff(formula = Surv(time, status) ~ sex, data = cancer)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 138      112     91.6      4.55      10.3
## sex=2  90       53     73.4      5.68      10.3
## 
##  Chisq= 10.3  on 1 degrees of freedom, p= 0.001

In fixing the significance level at 5%, the p-value is lower than 0.05. Then, the two survival curves are different. The global survival probability is worst for men than women.

2.2 Semi-parametric estimators

Semi-parametric estimators allow covariates to be taken into account when estimating the survival function. In the following part, we will present semi-parametric estimator :

  • Cox model (proportional hazards model)

2.2.1 Cox estimator (proportional hazards model)

The Cox proportional hazards estimator of the instantaneous hazard function is often referred to as a semi-parametric estimator due to its mixed nature, combining parametric and nonparametric components.

The Cox survival model, introduced by David R. Cox, is considered semi-parametric because it has two distinct components:

  • Parametric Component: The model includes a parametric term that specifies the functional form of the effect of covariates on the relative hazard. However, it does not impose particular restrictions on the form of the baseline survival function.

  • Nonparametric Component: The baseline instantaneous hazard function (the hazard function for an individual with all covariates equal to zero) is not specified. Instead, it is estimated in a nonparametric manner, allowing the model to adapt to different shapes of the survival function.

It is the most used model in survival analysis.
In the case of two states survival models, it can be defined as follows:
\[\lambda_{i}(t)=\lambda_{0}(t)exp(\beta_{1}X_{1i}+...+\beta_{p}X_{pi})\] This model is called proportional hazards because the ratio of instantaneous hazards is constant over time:
\[\frac{\lambda_{i}(t)}{\lambda_{j}(t)}=exp(\beta_{1}X_{1i}+...+\beta_{p}X_{pi}-\beta_{1}X_{1j}-...-\beta_{p}X_{pj})\]

The model is also log-linear:
\[\ln(\lambda_{i}(t))=\ln(\lambda_{0}(t))+\beta_{1}X_{1i}+...+\beta_{p}X_{pi}\]

The code below verifies the assumptions required to use semi-parametric models.

2.2.1.1 Assumptions

2.2.1.1.1 Proportionnality of the hazards

The time dependent effect of the hazards are exactly the same for the two groups.

The hazard ratios for two individuals do not depend on time : \(HR = \frac{\lambda_{i}(t)}{\lambda_{j}(t)} = exp(\beta_{1}(X_{1i}-X_{1j})+...+\beta_{p}(X_{pi}-X_{pj}))\) is independent of time.

Basically, the instantaneous hazard curves have to be represented by two parallel lines.

For example, in comparing two hazard ratio curves with two differents covariate modalities, a HR of 2 means that the risk of death is twice as high in the first group as in the second group.

Here’s some example on how to check this assumption

2.2.1.1.1.1 Graphic validation

To make sure that the proportional hazards hypothesis is respected, it is possible to plot two \(ln(-ln(S_{i}(t)))\) curves for two different groups (in using descriptive Kaplan Meier estimator).

This plot is called log-log survival plot. It comes from the fact that the survival function is equal to (reminder : \(\Lambda(t)=\int_{0}^{t} \lambda(u) du = -\ln(S(t))\)):

\[ \begin{gathered} S(t \mid Z)=\exp \left(-\int_0^t \lambda(u \mid Z) d u\right) \\ \Longrightarrow \hat{S}(t \mid Z)=\exp \left(-\hat{\Lambda}_0(t) \exp \left(\hat{\beta}^{\prime} Z\right)\right)\\ \Longrightarrow ln(-ln(\hat{S}(t \mid Z))) = ln(\hat{\Lambda}_0(t))+ \hat{\beta}^{\prime} Z \end{gathered} \]

It is much more easier to notice that two curves differ by a constant factor instead of exponential factor.

If the two curves are parallel, the proportional hazards hypothesis is respected.

# ce code permet de vérifier l'hypothèse de proportionnalité des risques pour la base de donnée "Cancer"
nouveau_df <- data.frame(
  time=kaplan_women$time,
  ln_surv = log(kaplan_women$surv),
  ln_ln_surv = log(-log(kaplan_women$surv))
)

nouveau_df2 <- data.frame(
  time=kaplan_men$time,
  ln_surv = log(kaplan_men$surv),
  ln_ln_surv = log(-log(kaplan_men$surv))
)

# Fusionner les dataframes
merged_df <- rbind(
  data.frame(group = "Women", nouveau_df),
  data.frame(group = "Men", nouveau_df2)
)

# Tracer le graphique
ggplot(merged_df, aes(x = time, y = ln_ln_surv, color = group)) +
  geom_line(size = 1) +
  labs(title = "Variation de ln(-ln(surv)) en fonction du temps", x = "Time", y = "ln_ln_surv") +
  scale_y_continuous(limits = c(min(merged_df$ln_ln_surv), max(merged_df$ln_ln_surv)))

Graphic validation can help to check the proportional hazards hypothesis. Here’s an example of two hazard ratio curves which are parallel thus respect the proportional hazards hypothesis :

Example of proportional hazards

Example of proportional hazards

Here’s an example of violation of the proportional hazards hypothesis :
Example of violation of proportional hazards

Example of violation of proportional hazards

The graphic validation isn’t powerful enough to validate the proportional hazards hypothesis.
It exists some statistical test to provide a more formal validation of the proportional hazards hypothesis.

2.2.1.1.1.2 Schoenfeld residuals test

The proportional hazards (PH) assumption can be checked using statistical tests and graphical diagnostics based on the scaled Schoenfeld residuals to test the proportional hazards assumption for each covariate included in a Cox refression model fit.

In knowing that \(\beta\) is the solution of the maximum likelihood estimate (explained in the next part): \[ \sum_{i=1}^{n} (x_{i}-E(x_{i}|R(t_{i})))=0\]

Firstly, let’s define the Schoenfeld residuals at time \(T_{i}\) :

\[ r^{\text{schoenfeld}}_{i}=x_{i}-E(x_{i}|R(t_{i}))\] with m covariates in cox model : \(r_{i}=(r_{i,1},...,r_{i,m})\)

where \(x_{i}\) is the covariate vector of the individual experiencing the i-th event and \(R(t_{i})\) is the risk set at time \(t_{i}\).

Basically, the Schoenfeld residuals measure the difference between the value of the covariate and the The scaled Schoenfeld residuals are defined as follows : \[ \hat{r}^{\text{scaled}}_{i,m}=\frac{r_{i,m}}{\hat{V}_{m}}\] with \(\hat{V}_{m}\) the variance of \(r_{i,m}\)

To test the time dependent coefficient, for a single covariate \(X_{m}\), the proportional hazards is expanded as :

\[ \beta_{m}(t)=\beta_{m}+\beta_{m}*g_{m}(t) \] where \(g_{m}(t)\) is a predictable process.

Thus, the scaled Schoenfeld residuals are defined as follows :\[ \hat{r}^{\text{scaled}}_{i,m}=\frac{r_{i,m}(\beta_{m})}{\hat{V}(\beta_{m},t_{i})}\] with \(\hat{V}_{m}\) the variance of \(r_{i,m}\)

The esperance of the scaled Schoenfeld residuals is equal to : \[E(\hat{r}^{\text{scaled}}_{i,m})=\beta_{m}(t_{i})-\beta_{m}\]

Grambsch and Therneau have shown that the variance is stable over time, thus in plotting the values of \(r^{\text{scaled}}_{m}+\beta_{m}\) against time, the plot should be flat if the proportional hazards assumption holds. Then a score test can be used to test the null hypothesis.

Basically, this Schoenfeld residuals measure the scaled difference between observed and expected values of covariates under the hypothesis of constant risk to assess whether the effect of covariates on risk is constant over time.

All these assumptions can be checked in using the function cox.zph() from survival package.

The statistic of the test is defined as follows :

  • Null hypothesis : \(H_{0}\) : The scaled schoenfeld residuals are not time dependents (\(\beta_{m}(t_{i})=\beta_{m}\))

  • Alternative hypothesis : \(H_{1}\) :The scaled schoenfeld residuals are time dependents

The proportional risk hypothesis is confirmed by a non-significant relationship between residuals and time, and refuted by a significant relationship.

Thus, the p-value must be less than 0.05 to admit a violation of the proportional risk hypothesis.

# ce code permet de vérifier l'hypothèse de proportionnalité des risques pour la base de donnée "Cancer" : ph.karno + meal.cal + pat.karno retirer car violation du modèle
fit_cancer <- coxph(Surv(time, status) ~ sex + age + ph.ecog +  wt.loss + ph.karno + pat.karno + meal.cal, data = cancer)
temp_cancer <- cox.zph(fit_cancer)
print(temp_cancer)
##             chisq df     p
## sex        1.2313  1 0.267
## age        0.0317  1 0.859
## ph.ecog    2.7197  1 0.099
## wt.loss    0.0553  1 0.814
## ph.karno   5.6126  1 0.018
## pat.karno  3.2555  1 0.071
## meal.cal   6.2797  1 0.012
## GLOBAL    13.4993  7 0.061
temp_cancer$call
## cox.zph(fit = fit_cancer)
# survminer package
ggcoxzph(temp_cancer[5], xlab = "Time", pval = TRUE, ggtheme = theme_bw())

# survminer package
ggcoxzph(temp_cancer[7], xlab = "Time", pval = TRUE, ggtheme = theme_bw())

The plot gives an estimate of the time-dependent coefficient \(\beta(t)\). If the proportional hazards assumption holds then the true \(\beta(t)\) function would be a horizontal line.
Grambsch, P. M., and Therneau, T. M. have shown that if the regression coefficients \(\beta\) of the Cox proportional hazards model do not change with time, then the mean of the scaled Schoenfeld Residuals is zero.

Thus a low p-value indicates:

  • the Schoenfeld residuals are not constant over time

  • there is evidence that the variable/predictor may be time-dependent

Thus proportional-hazards assumption (made when generating the coxph model) may be violated by this variable.

In our example, variable meal.cal and ph.karno have p-value<5% thus the proportional hazards hypothesis is violated. To resolve this problem, we can consider a time dependent covariate :

A violations of proportional hazards assumption can be resolved by:

  • Adding covariate*time interaction : The interaction have to be known in analysing the type of interaction

  • Stratification : The stratification is a way of splitting the population into subgroups according to the covariate.

  • Partition the time axis : The time axis is divided into several intervals and the Cox model is applied to each interval.

  • Use a different model such as accelerated failure time model or additive hazard model.

2.2.1.1.1.3 Consider time dependent covariate

For testing the proportional hazards hypothesis, it’s possible to consider a time dependent covariate in the model.
The hazard ratio for two groups becomes a function of time : \(\frac{\lambda_{i}(t)}{\lambda_{j}(t)}=exp(\beta_{1}(Z_{1i}-Z_{1j}))* t^{\gamma*(Z_{1i}-Z_{1j})}\) with Z the covariates Then, the proportional hazards hypothesis is respected if \(\gamma=0\).

In our case, we will consider a time dependent covariate for the variable meal.cal (linear effect) and plot the summary of the two models (one with the interaction and without)

# Test du risque proportionnel avec interaction avec le temps
fit_cancer_time <- coxph(Surv(time, status) ~ sex + age + ph.ecog +  wt.loss + ph.karno + pat.karno + tt(meal.cal) , data = cancer,
 tt=function(x,t,...) x*t)

summary(fit_cancer_time)
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog + wt.loss + 
##     ph.karno + pat.karno + tt(meal.cal), data = cancer, tt = function(x, 
##     t, ...) x * t)
## 
##   n= 168, number of events= 121 
##    (60 observations deleted due to missingness)
## 
##                    coef  exp(coef)   se(coef)      z Pr(>|z|)    
## sex          -5.379e-01  5.840e-01  2.002e-01 -2.687 0.007206 ** 
## age           1.316e-02  1.013e+00  1.160e-02  1.134 0.256866    
## ph.ecog       7.637e-01  2.146e+00  2.259e-01  3.381 0.000722 ***
## wt.loss      -1.560e-02  9.845e-01  7.613e-03 -2.049 0.040420 *  
## ph.karno      2.248e-02  1.023e+00  1.144e-02  1.966 0.049310 *  
## pat.karno    -1.410e-02  9.860e-01  8.003e-03 -1.762 0.078084 .  
## tt(meal.cal)  1.347e-06  1.000e+00  7.488e-07  1.799 0.072011 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##              exp(coef) exp(-coef) lower .95 upper .95
## sex             0.5840     1.7124    0.3945    0.8645
## age             1.0132     0.9869    0.9905    1.0365
## ph.ecog         2.1463     0.4659    1.3785    3.3417
## wt.loss         0.9845     1.0157    0.9699    0.9993
## ph.karno        1.0227     0.9778    1.0001    1.0459
## pat.karno       0.9860     1.0142    0.9707    1.0016
## tt(meal.cal)    1.0000     1.0000    1.0000    1.0000
## 
## Concordance= 0.649  (se = 0.03 )
## Likelihood ratio test= 31.35  on 7 df,   p=5e-05
## Wald test            = 29.87  on 7 df,   p=1e-04
## Score (logrank) test = 30.57  on 7 df,   p=7e-05
summary(fit_cancer)
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog + wt.loss + 
##     ph.karno + pat.karno + meal.cal, data = cancer)
## 
##   n= 168, number of events= 121 
##    (60 observations deleted due to missingness)
## 
##                 coef  exp(coef)   se(coef)      z Pr(>|z|)   
## sex       -5.509e-01  5.765e-01  2.008e-01 -2.743  0.00609 **
## age        1.065e-02  1.011e+00  1.161e-02  0.917  0.35906   
## ph.ecog    7.342e-01  2.084e+00  2.233e-01  3.288  0.00101 **
## wt.loss   -1.433e-02  9.858e-01  7.771e-03 -1.844  0.06518 . 
## ph.karno   2.246e-02  1.023e+00  1.124e-02  1.998  0.04574 * 
## pat.karno -1.242e-02  9.877e-01  8.054e-03 -1.542  0.12316   
## meal.cal   3.329e-05  1.000e+00  2.595e-04  0.128  0.89791   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## sex          0.5765     1.7347    0.3889    0.8545
## age          1.0107     0.9894    0.9880    1.0340
## ph.ecog      2.0838     0.4799    1.3452    3.2277
## wt.loss      0.9858     1.0144    0.9709    1.0009
## ph.karno     1.0227     0.9778    1.0004    1.0455
## pat.karno    0.9877     1.0125    0.9722    1.0034
## meal.cal     1.0000     1.0000    0.9995    1.0005
## 
## Concordance= 0.651  (se = 0.029 )
## Likelihood ratio test= 28.33  on 7 df,   p=2e-04
## Wald test            = 27.58  on 7 df,   p=3e-04
## Score (logrank) test = 28.41  on 7 df,   p=2e-04

In using LR test formed by comparing the log-likelihoods for the two models, the PH assumption can be checked.

The statistic of the test is defined as follows :

  • Null hypothesis : \(H_{0}\) : There is no significant difference between the two models.

  • Alternative hypothesis: \(H_{1}\): There is a significant difference between the two models, suggesting that one model fits the survival data significantly better than the other.

The proportional risk hypothesis is confirmed by a non-significant relationship between residuals and time, and refuted by a significant relationship.

# Assuming cox_model1 and cox_model2 are your Cox models
loglik1 <- logLik(fit_cancer)
loglik2 <- logLik(fit_cancer_time)

# Calculate the likelihood ratio test statistic
lr_test_stat <- -2 * (loglik1 - loglik2)

# Calculate the degrees of freedom
df <- attr(loglik2, "df")

# Perform the likelihood ratio test
p_value <- pchisq(lr_test_stat, df, lower.tail = FALSE)

# Print the results
cat("Likelihood Ratio Test:\n")
## Likelihood Ratio Test:
cat("Test Statistic:", lr_test_stat, "\n")
## Test Statistic: 3.021548
cat("Degrees of Freedom:", df, "\n")
## Degrees of Freedom: 7
cat("P-value:", p_value, "\n")
## P-value: 0.8830017

In our case, it doesn’t seem to be a good idea to consider a time dependent covariate.
The test does not reject the null hypothesis. The results are not better.

Then, we can consider a stratification of the population according to the covariate : create categories for the two variables (meal.cal and ph.karno).

In our case we only remove the two covariates :

# on retire la variable qui ne respecte pas l'hypothèse de proportionnalité des risques
fit_cancer <- coxph(Surv(time, status) ~ sex + age + ph.ecog +  wt.loss + pat.karno , data = cancer,x=TRUE)

print(fit_cancer)
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog + wt.loss + 
##     pat.karno, data = cancer, x = TRUE)
## 
##                coef exp(coef)  se(coef)      z       p
## sex       -0.577950  0.561048  0.177127 -3.263 0.00110
## age        0.010714  1.010772  0.009705  1.104 0.26959
## ph.ecog    0.404113  1.497974  0.144410  2.798 0.00514
## wt.loss   -0.012090  0.987983  0.006901 -1.752 0.07979
## pat.karno -0.012371  0.987705  0.007260 -1.704 0.08837
## 
## Likelihood ratio test=33.11  on 5 df, p=3.581e-06
## n= 210, number of events= 148 
##    (18 observations deleted due to missingness)
#plot(temp_cancer)
2.2.1.1.2 Log-linearity of covariates

Covariates have a multiplicative effect on instantaneous risk.It assumes that each variable makes a linear contribution to the model, however, this assumption should be checked.

Plotting the Martingale residuals against continuous covariates is a common approach used to detect nonlinearity or, in other words, to assess the functional form of a covariate.

Martingale residuals are computed for each individual and defined as follows :

\[ \begin{array}{r} \delta_i-\Lambda\left(T_i\right) =\delta_i-\beta_0\left(T_i\right) \exp \left(\beta^T x_i\right) \end{array} \] where \(T_{i}\) is the total observation time of subject \(i\) and \(\delta_{i}\) is the censoring indicator.

Martingale residuals take a value between \([1;+\infty[\) for uncensored observations and \(]-\infty;0]\) for censored observations.

Positive values mean that the patient died sooner than expected (according to the null model : coefficient of covariate at 0); negative values mean that the patient lived longer than expected (or were censored).

ggcoxfunctional(Surv(time, status) ~ age + log(age) + sqrt(age), data = lung)
## Warning: arguments formula is deprecated; will be removed in the next version;
## please use fit instead.

It appears that, nonlinearity is slightly here.

2.2.1.2 Partial likelihood

In considering,

  • \(D\) the number of deaths observed among the \(n\) subjects in the study,

  • \(T_1<T_2<\ldots<T_D\), the times of distinct events (deaths), (1), (2), . (D), the indices of individuals who died respectively in \(T_1, T_2, \ldots, T_D\),

  • \(Z_i\) the value of individual \(i\)’s covariates

  • \(R\left(T_i\right)\) the set of individuals still at risk at \(T_i^{-}\) (just before \(T_i\) ),

For our understanding, let’s assume, that there is only one death at each event time (Distinct event).
The probability of an event (death) occurring at \(T_i\) is given by :

\[ \sum_{j \in R\left(T_i\right)} \lambda_0\left(T_i\right) \exp \left(\beta^{\prime} Z_j\right) . \]

The probability that individual i experiences the event at \(T_i\), given that an event has occurred at \(T_i\), is:

\[ \frac{\lambda_0\left(T_i\right) \exp \left(\beta^{\prime} Z_{(i)}\right)}{\sum_{j \in R\left(T_i\right)} \lambda_0\left(T_i\right) \exp \left(\beta^{\prime} Z_j\right)}=\frac{\exp \left(\beta^{\prime} Z_{(i)}\right)}{\sum_{j \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_j\right)} . \]

The key point is that this probability depends solely on the parameter \(\beta\) and no more on The baseline survival function : \(\lambda_{0}(t)\) which is not specified in the model (semi-parametric).
Since there are contributions to the likelihood at each death time, the partial likelihood of Cox is defined as the product over death times.

The (partial) total likelihood is therefore:

\[ L_{Cox}(\beta)=\prod_{i=1}^D \frac{\exp \left(\beta^{\prime} Z_{(i)}\right)}{\sum_{j \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_j\right)} \]

In case of simultaneous events, the partial likelihood is defined as follows:

The previous reasoning assumes distinct event times. In the case of real data, this assumption is not always verified (e.g.: measurements every month or quarter). The probability that individual \(j\) will die in \(T_i\) is \[ p_j=\frac{\exp \left(\beta^{\prime} Z_j\right)}{\sum_{k \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_k\right)} . \]

In the presence of several events, the “exact” method is to assume that the events occur one after the other. However, the order of events is not known, so all possibilities must be considered. In the case of two subjects \(s_1\) and \(s_2\) with characteristics \(Z_1\) and \(Z_2\) who die in \(T_i\), the exact contribution to the likelihood is \[ \frac{\exp \left(\beta^{\prime} Z_1\right) \exp \left(\beta^{\prime} Z_2\right)}{\sum_{j \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_j\right) \times \sum_{j \in R\left(T_i\right) \backslash s_1} \exp \left(\beta^{\prime} Z_j\right)}+\frac{\exp \left(\beta^{\prime} Z_1\right) \exp \left(\beta^{\prime} Z_2\right)}{\sum_{j \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_j\right) \times \sum_{j \in R\left(T_i\right) \backslash s_2} \exp \left(\beta^{\prime} Z_j\right)} \]

The problem with this method is that calculation time becomes very long when there are many simultaneous events. Thus, we most often use the Breslow approximation, which consists in assuming that the contribution of \(d_i\) events in \(T_i\) is the product of the probabilities \(p_j\) for units that died in \(T_i\left(\right.\) i.e. \(\left.\sum_{j \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_j\right) \approx \sum_{j \in R\left(T_i\right) \backslash k} \exp \left(\beta^{\prime} Z_j\right)\right)\), \[ L_B\left(T_i\right)=\prod_{\substack{j: \text { unités } \\ \text { décédées en } T_i}} p_j=\frac{\exp \left(\beta^{\prime}\left(\sum_{\substack{j: \text { unités } \\ \text { décédées en } T_i}} Z_j\right)\right)}{\left(\sum_{k \in R\left(T_i\right)} \exp \left(\beta^{\prime} Z_k\right)\right)^{d_i}} . \]

The Breslow approximation of the total likelihood is \[ \prod_{i=1}^D L_B\left(T_i\right) \]

Before applying the Cox model and to allow the performance of the Cox model to be evaluated, a stratification of the data is necessary.

# retirer les NA de cancer
cancer <- na.omit(cancer)
cancer <- cancer[,-1]

trainIndex <- createDataPartition(cancer$status, p = .7)
                                  #list = FALSE, 
                                  #times = 1)
cancer.train <- cancer[ trainIndex$Resample1,]
cancer.test  <- cancer[-trainIndex$Resample1,]

X<- as.matrix(subset(cancer, select = -c(time, status)))
Y<- as.matrix(subset(cancer, select = time))
D <- as.matrix(subset(cancer, select = status))

X.train<- as.matrix(subset(cancer.train, select = -c(time, status)))
Y.train<- as.matrix(subset(cancer.train, select = time))
D.train <- as.matrix(subset(cancer.train, select = status))

X.test<- as.matrix(subset(cancer.test, select = -c(time, status)))
Y.test<- as.matrix(subset(cancer.test, select = time))
D.test <- as.matrix(subset(cancer.test, select = status))
2.2.1.2.1 Maximisation of partial likelihood

The coefficients of the Cox model are estimated by maximizing the partial likelihood of Cox model.

The underlying principle is that the parameter values that maximize the likelihood of the observed data are the most plausible values, given the statistical model chosen.

The maximization of the partial likelihood is calculated instead of the total likelihood to significantly reduce computation time. However, censoring and lifetime must be independent, and censoring must be non-informative.

The coefficients are calculated by the coxph function in R’s survival package (already applied in the chunk above).

set.seed(100)
fit_cancer <- coxph(Surv(time, status) ~ sex + age + ph.ecog +  wt.loss + pat.karno , data = cancer.train ,x=TRUE)
#fit_cancer$var
print(fit_cancer)
## Call:
## coxph(formula = Surv(time, status) ~ sex + age + ph.ecog + wt.loss + 
##     pat.karno, data = cancer.train, x = TRUE)
## 
##                coef exp(coef)  se(coef)      z      p
## sex       -0.468947  0.625661  0.235343 -1.993 0.0463
## age       -0.007602  0.992427  0.014259 -0.533 0.5939
## ph.ecog    0.473832  1.606137  0.190446  2.488 0.0128
## wt.loss   -0.017703  0.982453  0.008763 -2.020 0.0434
## pat.karno -0.005007  0.995005  0.008988 -0.557 0.5775
## 
## Likelihood ratio test=13.64  on 5 df, p=0.01806
## n= 117, number of events= 93

In our case, we can see the cox coefficients in the tables displayed above.

The “exp(coef)” column corresponds to the hazard ratio of each covariate. Given the assumption of proportional risk, the hazard ratio is constant.

Thus, “exp(coef)” corresponds to the hazard ratio of each covariate :

  • For a binary variable coded 0/1: \(HR=exp(coef)\),

  • For a binary variable coded a/b: \(HR=exp(coef*(b-a))\),

  • For a continuous variable, \(exp(coef)\) corresponds to the hazard ratio for a one-unit increase in the variable. For example, for “age” variable, each additional year of life multiply the instantaneous risk of death by a factor of 1,01, an increase of 1%.

The table also displays the 95% confidence interval of the hazard ratio for each covariate.

The table also displays the value of the significance test for a coefficient (z) and the p-value associated with this test.
The p-value must be less than 0.05 to admit the significance of a coefficient.

The likelihood ratio test is also displayed. They involve comparing two models: one without variables (reduced model) and one with (full model). It is known as the coefficients nullity test.
The p-value must be less than 0.05 to reject that the variable(s) have no effect on survival.

Thus, for the Cox model of the “cancer” database, the coefficients for sex, ph.ecog and wt.loss are significantly different from 0. The hazard ratio for the sex variable is 0,6. It would appear that women have better (reduction of the instantaneous risk of 40%) survival than men.

The hazard ratio of the ph.ecog variable is 1,9. The ECOG score is an indicator of the patient’s physical performance. An increase of 1 in the ecog score increases the instantaneous risk of death by approximatively 90%.

The hazard ratio of the wt.loss variable is 0,97. An increase of 1 in the weight loss decreases the instantaneous risk of death by 3%.

2.2.1.3 Prediction of survival probability

After fitting the Cox model, visualization of the survival curve for a binary survival model is straightforward. The ggsurvplot function in the survminer package estimates the survival curve by setting the covariates to their respective means.

ggsurvplot(survfit(fit_cancer,data=cancer.test), palette = "#2E9FDF",ggtheme = theme_minimal())

The error in this graph is the standard error of the survival curve computed with the variance of the coefficient estimator.

You can also create a new database to set the parameters you want. For example, you can set gender to 1 (male) or 2 (female) and age to the average age.

# Creation de nouvelle bdd pour fixer les paramètre que l'on souhaite
sex_df <- with(cancer,
               data.frame(sex = c(1, 2), 
                          age = rep(mean(age, na.rm = TRUE), 2),
                          ph.ecog = c(1, 1),
                          wt.loss = c(0, 0),
                          pat.karno = c(100, 100)
                          )
               )
sex_df
##   sex      age ph.ecog wt.loss pat.karno
## 1   1 62.56886       1       0       100
## 2   2 62.56886       1       0       100
# Courbe de survie
fit <- survfit(fit_cancer, newdata = sex_df)
ggsurvplot(fit, data=sex_df,conf.int = TRUE, legend.labs=c("Sex=Homme", "Sex=Femme"),
           ggtheme = theme_minimal())
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## ℹ Please use `gather()` instead.
## ℹ The deprecated feature was likely used in the survminer package.
##   Please report the issue at <https://github.com/kassambara/survminer/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

2.2.1.4 Evaluation of prediction performance

To evaluate the prediction performance of a Cox model, a visual validation with Kaplan meier estimator can be done and specifically for Cox model, Dynamic AUC can be implemented as it’s the only one implemented.

2.2.1.4.1 Visual validation

What can be done is to do the mean of the Cox survival curves for all the individual and compare the mean survival curve with the survival curve of the Kaplan Meier estimator survival curve. As kaplan meier is descriptive estimator, the mean of the cox survival curves of all individuals will refer to KM curve.
If the two curves are close, the model is good.

# ce code permet de vérifier visuellement la performance du modèle de Cox pour la base de donnée "Cancer"

# Calcul de la courbe de survie moyenne pour tous les individus
fit <- survfit(fit_cancer, newdata = cancer)
#mean_survival <- apply(fit, 2, mean)

#fit$time
mean_survival<-apply(fit$surv,1,mean)

kaplan <- survfit(Surv(time, status) ~ 1, data = cancer)
#kaplan$surv
# Courbe de survie moyenne
ggsurvplot(kaplan, data=cancer.test,conf.int = TRUE,
           ggtheme = theme_minimal())

plot(NA, NA, xlab = "failure time", ylab = "survival function",
     xlim = range(kaplan$time),
     ylim = c(0, 1))
lines(fit$time, mean_survival, col = "red",lwd=3)
lines(kaplan$time, kaplan$surv, col = "blue",lwd=3)

2.2.1.4.2 Cumulative/Dynamic AUC

Sensitivity, specificity, and area under the ROC curve (AUC) are often used to measure the ability of survival models to predict future risk.

Cumulative cases are all individuals that experienced an event prior to or at time \(t\left(t_i \leq t\right)\), whereas dynamic controls are those with \(t_i>t\). The associated cumulative/dynamic AUC quantifies how well a model can distinguish subjects who fail by a given time \(\left(t_i \leq t\right)\) from subjects who fail after this time \(\left(t_i>t\right)\).

Given an estimator of the \(i\)-th individual’s risk score \(\hat{f}\left(\mathbf{x}_i\right)\), the cumulative/dynamic AUC at time \(t\) is defined as \[ \widehat{\mathrm{AUC}}(t)=\frac{\sum_{i=1}^n \sum_{j=1}^n I\left(y_j>t\right) I\left(y_i \leq t\right) \omega_i I\left(\hat{f}\left(\mathbf{x}_j\right) \leq \hat{f}\left(\mathbf{x}_i\right)\right)}{\left(\sum_{i=1}^n I\left(y_i>t\right)\right)\left(\sum_{i=1}^n I\left(y_i \leq t\right) \omega_i\right)} \] where \(\omega_i\) are inverse probability of censoring weights (IPCW).

train.fit <- coxph(Surv(time, status) ~ sex + age + ph.ecog +  wt.loss + pat.karno , data = cancer.train,x=TRUE,y=TRUE)

lp <- predict(train.fit)
lpnew <- predict(train.fit, newdata=cancer.test)

Surv.rsp <- survival::Surv(cancer.train$time, cancer.train$status)
Surv.rsp.new <- survival::Surv(cancer.test$time, cancer.test$status)

times <- seq(20, 1000, 10)
AUC_cd.cox <- AUC.cd(Surv.rsp, Surv.rsp.new,lp, lpnew, times)

plot(AUC_cd.cox)
abline(h = 0.5, col = "blue")

The function AUC.cd() calculates the dynamic AUC for a Cox model but it exists other functions with some variants : AUC.hc, AUC.sh, AUC.uno.

This technics can be used for other model but is not implemented.

2.3 Survival random forest

Survival forests are non-parametric estimators of the survival function. Random survival forests (RSF) was introduced to extend RF to the setting of right-censored survival data. The influence of covariates is integrated in this estimator.

The survival forest is a set of survival trees.
It is adapted to right censoring and is used to estimate the conditional survival \(S(t,X)=P(T>t|X=x)\).

The random survival forest algorithm can be detailed as follows:

1 - Draw a bootstrap sample from the database

2 - Build a survival tree on each of the bootstrap samples using randomly drawn covariates:

  • At each node, select a number of covariates_ at random

  • Select the best split that maximizes the survival difference between the two nodes (by using metrics such as log-rank splitting that splits nodes by maximization of the log-rank test statistic, gradient based brier score splitting or log-rank score splitting).

3 - Continue building the tree until each terminal node (leaf) contains a minimum number of observations.

4 - Calculate the cumulative hazard function for each tree

5 - Average to obtain the ensemble CHF.

6 - Using OOB data, calculate prediction error for the ensemble CHF

Then, to obtain the aggregate, these individual estimates of the cumulative hazard function from different trees are averaged. This gives an aggregated estimate of cumulative risk for the model as a whole.

It is possible to train the forest on all the data and estimate the error with the OOB (out of bag) data, or to separate the data into two parts: Train and Test, and then evaluate the prediction on the test data.

Survival random forest

Survival random forest

C-index can be explained easier : if one of the two individuals has a higher risk score than the other, then the individual with the higher risk score should die sooner than the other.

2.3.1 Package GRF

The grf package lets you build a survival random forest. The forest is easily constructed using the survival_forest function.

Here’s the default parameter :

  • The default number of trees is 1,000,

  • The number of randomly drawn covariates is \(min(\sqrt{p}+20,p)\)

  • The minimum number of observations in each leaf is set at 15. This minimum number must be greater than 1, since we’re trying to estimate a probability.
    Expanding the trees as much as possible tends to give extreme conditional probability results (0 or 1), which is what we want to avoid.

The same parameters is used for the RFSRC model also.

# ce code permet de calculer l'estimateur de survival forest pour la base de donnée "Cancer"
# train a survival forest
forest_cancer.grf<-survival_forest(X.train, Y.train, D.train)
plot(get_tree(forest_cancer.grf, 3))
# prediction on test set
s.pred.grf<- predict(forest_cancer.grf, X.test,estimate.variance=TRUE)

# Prédiction conformelle sur l'ensemble de test

plot(NA, NA, xlab = "failure time", ylab = "survival function",
     xlim = range(s.pred.grf$failure.times),
     ylim = c(0, 1))
lines(s.pred.grf$failure.times, s.pred.grf$predictions[2,], col = "red",lwd=3)
lines(s.pred.grf$failure.times, s.pred.grf$predictions[1,], col = "blue",lwd=3)

print(forest_cancer.grf)
## GRF forest object of type survival_forest 
## Number of trees: 1000 
## Number of training samples: 117 
## Variable importance: 
##     1     2     3     4     5     6     7 
## 0.217 0.032 0.075 0.108 0.140 0.232 0.195

“Prediction.type” correspond to the type of estimate of the survival function, it can be “Kaplan-Meier” (survival function) or “Nelson-Aalen” (cumulative hazard function).

The importance of variables is calculated using the variable_importance function in the grf package. It simply calculates the number of times a variable has been used to construct a split at each depth in a tree. In other words, we count the number of times a variable has been used for each split.
We then divide by the total number of occurrences of each variable in the split concerned (weighted average for each split). The deeper the split, the less importance the split has in the choice of variable.
So, we weight the importance of the split by the exponential decay (\(\frac{1}{\lambda^{decay.exponent}}\) with lambda = depth of the split). Then we take the matrix product of the transpose of the matrix of averaged split frequencies with the vector of weights, divided by the sum of the weights (weighted average of the average split frequencies by the weights).

The variable 1 corresponds to the age, the variable 2 to sex, the variable 3 to ph.ecog, the variable 4 to ph.karno, the variable 5 to pat.karno, the variable 6 to meal.cal and the variable 7 to wt.loss.

There, the most important variable seems to be ph.karno, then ph.ecog, then age.

decay.exponent<-2
split.freq<-split_frequencies(forest_cancer.grf,  max.depth = 4)
split.freq2<-split.freq / pmax(1L, rowSums(split.freq))
weight <- seq_len(nrow(split.freq2))^-decay.exponent
weight
## [1] 1.0000000 0.2500000 0.1111111 0.0625000
t(split.freq2) %*% weight / sum(weight)
##            [,1]
## [1,] 0.21737945
## [2,] 0.03160656
## [3,] 0.07497640
## [4,] 0.10840035
## [5,] 0.13984204
## [6,] 0.23245063
## [7,] 0.19534457
# calculer l'importance des variables du modèle 
feat_imp_df<-variable_importance(forest_cancer.grf)

# il est possible de l'extraire via le random survival forest entrainé 
#feat_imp_df<-forest_cancer$variable.importance 

# plot de la variable importance
# ajouter les noms des variables
feat_imp_df<-cbind(variable=colnames(X.test),feature=feat_imp_df)
colnames(feat_imp_df)<-c("variable","values")
feat_imp_df<-as.data.frame(feat_imp_df)
class(feat_imp_df$values)
## [1] "character"
feat_imp_df$values<-as.numeric(feat_imp_df$values)
feat_imp_df <- feat_imp_df[order(feat_imp_df$values, decreasing=TRUE),]
feat_imp_df$values<-round(feat_imp_df$values,2)

ggplot(feat_imp_df, aes(x = reorder(variable, values), y = values)) +
    geom_bar(stat='identity') +
    coord_flip() +
    theme_classic() +
    labs(
      x     = "Feature",
      y     = "Importance",
      title = "Feature Importance: Survival random forest"
    )+
    #scale_x_continuous(limits=c(0,1))+
    theme(axis.text.x= element_text(angle = 45, hjust = 1, size = 20),  # Ajuste la taille ici)
    plot.title = element_text(hjust = 0.5))+
    theme(axis.text.y= element_text(angle = 0, hjust = 1, size = 20))  # Ajuste la taille ici)

2.3.2 Package randomForestSRC

# ce code permet de calculer l'index de concordance pour la base de donnée "Cancer"
# Index de concordance pour le modèle de cox


#ici on retire les NA pour permettre la comparaison avec les autres modèles qui ne supportent
# pas les données manquantes
#splitrule default = logrank (can use bs.gradient or logrankscore)
forest_cancer.rfsrc<-rfsrc(Surv(time, status) ~ age+sex+ph.ecog+ph.karno+pat.karno+meal.cal+wt.loss , 
data = cancer.train, ntree=1000, mtry=7, splitrule = "logrank",
nodesize=15,forest=TRUE)


## plot tree number 3
plot(get.tree(forest_cancer.rfsrc, 3,surv.type="surv"))
print(forest_cancer.rfsrc)
##                          Sample size: 117
##                     Number of deaths: 93
##                      Number of trees: 1000
##            Forest terminal node size: 15
##        Average no. of terminal nodes: 6.765
## No. of variables tried at each split: 7
##               Total no. of variables: 7
##        Resampling used to grow trees: swor
##     Resample size used to grow trees: 74
##                             Analysis: RSF
##                               Family: surv
##                       Splitting rule: logrank *random*
##        Number of random split points: 10
##                           (OOB) CRPS: 0.15581095
##    (OOB) Requested performance error: 0.45269904

The tree is built using the logrank splitting rule also. The quantities below each node are the number of individuals at risk and the predicted survival probability at the default time (median follow up).

## plot survival curves for first 10 individuals -- direct way
matplot(forest_cancer.rfsrc$time.interest, 100 * t(forest_cancer.rfsrc$survival.oob[1:10, ]),
xlab = "Time", ylab = "Survival", type = "l", lty = 1)

From 50 trees in survival random forest, the prediction error is stable.

2.4 Validation and selection of the model

We need to be able to select a model from all the models we’ve calculated. There are several ways of doing this:

  • Performance metrics :

    • Concordance index

    • Brier score

    • Log-rank test

  • Graph:

    • In the case of marginal models: Average over all individuals and compare with non-marginal models (Kaplan meier)

2.4.1 Visuel

#Visualisation des courbes de survie pour les trois modèles 
# On utilise les newdf pour plotter les modèles avec covariables
newdf <- with(cancer,
               data.frame(sex =  mean(sex, na.rm = TRUE), 
                          age = mean(age, na.rm = TRUE),
                          ph.ecog =  mean(ph.ecog, na.rm = TRUE),
                          wt.loss =  mean(wt.loss, na.rm = TRUE),
                          pat.karno =   mean(pat.karno, na.rm = TRUE)
                          )
               )
newdf
##        sex      age   ph.ecog  wt.loss pat.karno
## 1 1.383234 62.56886 0.9580838 9.718563  79.58084
newdf2 <- with(cancer,
               data.frame(age = mean(age, na.rm = TRUE),
                          sex = 1, 
                          ph.ecog = 1,
                          ph.karno = 100,
                          pat.karno =  100,
                          meal.cal = 1000,
                          wt.loss = 0                          
                          )
               )
newdf2
##        age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 62.56886   1       1      100       100     1000       0
# Courbe de survie cox pour les covariables fixées au dessus 
fit <- survfit(fit_cancer, newdata = newdf2)

# Courbe de survie pour le random survival forest 
s.pred.grf<- predict(forest_cancer.grf, newdata = newdf2, prediction.type = "Kaplan-Meier")

s.pred.grf
## $predictions
##           [,1]      [,2]     [,3]      [,4]      [,5]      [,6]      [,7]
## [1,] 0.9875585 0.9861678 0.984117 0.9822361 0.9797122 0.9496857 0.9455752
##           [,8]      [,9]     [,10]     [,11]     [,12]     [,13]     [,14]
## [1,] 0.9414756 0.9232545 0.9221556 0.9166382 0.9076524 0.8956323 0.8909124
##          [,15]     [,16]     [,17]    [,18]     [,19]     [,20]     [,21]
## [1,] 0.8827536 0.8775229 0.8681692 0.863397 0.8499184 0.8348347 0.8280847
##          [,22]     [,23]     [,24]     [,25]     [,26]     [,27]     [,28]
## [1,] 0.8107049 0.8059506 0.7889273 0.7710179 0.7678759 0.7593106 0.7501845
##         [,29]     [,30]     [,31]     [,32]     [,33]     [,34]     [,35]
## [1,] 0.748342 0.7389073 0.7226253 0.7177992 0.7137978 0.7104923 0.7032363
##          [,36]     [,37]     [,38]     [,39]     [,40]     [,41]     [,42]
## [1,] 0.6978775 0.6884897 0.6802749 0.6709304 0.6660228 0.6640963 0.6481521
##         [,43]    [,44]     [,45]    [,46]     [,47]     [,48]     [,49]
## [1,] 0.642502 0.628512 0.6148173 0.596163 0.5705246 0.5598355 0.5557108
##          [,50]     [,51]     [,52]     [,53]     [,54]     [,55]     [,56]
## [1,] 0.5480189 0.5357658 0.5319606 0.5269341 0.5244719 0.5053395 0.4968118
##          [,57]     [,58]     [,59]     [,60]     [,61]    [,62]     [,63]
## [1,] 0.4943996 0.4783187 0.4722841 0.4572586 0.4340468 0.423088 0.4143602
##          [,64]     [,65]     [,66]     [,67]     [,68]     [,69]     [,70]
## [1,] 0.3907123 0.3680612 0.3620686 0.3587711 0.3426576 0.3323031 0.3140534
##          [,71]     [,72]     [,73]     [,74]     [,75]     [,76]     [,77]
## [1,] 0.3051585 0.2852882 0.2569796 0.2382302 0.2351069 0.2180263 0.2016186
##          [,78]     [,79]     [,80]     [,81]     [,82]      [,83]      [,84]
## [1,] 0.1934472 0.1802005 0.1557578 0.1421948 0.1124179 0.09813657 0.07949986
##           [,85]      [,86]      [,87]      [,88]
## [1,] 0.06877616 0.05062836 0.03067741 0.02718376
## 
## $failure.times
##  [1]  11  12  13  26  30  53  54  59  60  61  79  81  88  92  95 107 110 118 135
## [20] 142 145 147 153 156 163 166 167 170 176 180 181 183 197 201 210 223 226 229
## [39] 230 239 245 246 267 268 269 283 285 286 291 293 301 303 305 310 320 348 351
## [58] 353 361 363 371 390 426 428 429 433 444 450 455 457 460 473 477 519 524 550
## [77] 558 567 583 641 643 655 689 705 707 731 791 814
plot(NA, NA, xlab = "failure time", ylab = "survival function",
     xlim = range(s.pred.grf$failure.times),
     ylim = c(0, 1))
lines(kaplan2$time, kaplan2$surv, col= "red", lwd=3)
lines(kaplan2$time, kaplan2$lower, col= "red", lty=2)
lines(kaplan2$time, kaplan2$upper, col= "red", lty=2)
lines(fit$time, fit$surv, col= "blue", lwd=3)
lines(fit$time, fit$lower, col= "blue", lty=2)
lines(fit$time, fit$upper, col= "blue", lty=2)
lines(s.pred.grf$failure.times, s.pred.grf$predictions, col = "green",lwd=3)
legend("topright", legend = c("km", "cox", "rfsrc"), fill = c("red","blue","green"))

It is not efficient to check models in this way, so it is advisable to use performance metrics.

2.4.2 Index de concordance (Index de Harrell)

The concordance index, also known as the c-index or Harrell concordance, is a measure commonly used to evaluate the performance of survival models.

The concordance index measures the model’s ability to correctly classify pairs of events. More specifically, it assesses whether, among all pairs of individuals where one had an event before the other, the model assigns a higher risk to the one who had the event earlier.

The concordance index ranges from 0 to 1, where 0.5 indicates random performance (no discriminatory ability), and 1 indicates perfect classification. A c-index greater than 0.5 suggests that the model has better predictive ability than chance:

Concordance index

Concordance index

## COX

# Index de concordance pour le modèle de cox
c.cox<-concordance(fit_cancer)


## survival forest GRF

# Index de concordance pour le modèle de GRF 
#prediction sur la base de test
pred.nelson.aalen <- predict(forest_cancer.grf, prediction.type = "Nelson-Aalen")

# somme pour chaque individu de leur prediction
chf.score <- rowSums(-log(pred.nelson.aalen$predictions))

# calcul de l'index de concordance 
# reverse= TRUE pour notifier que les individus avec un score plus élevé aient un 
#risque plus élevé
# plus le chf.score est élevé, plus l'individu a un risque élevé de décès
c.grf<-concordance(Surv(Y.train, D.train) ~ chf.score, reverse = TRUE,ranks=TRUE)


## survival forest randomForestSRC

# OOB error rate pour le modèle de survival forest randomForestSRC
err.rfsrc <- forest_cancer.rfsrc$err.rate[forest_cancer.rfsrc$ntree]

c.rfsrc<-1-err.rfsrc


print(c.cox$concordance)
## [1] 0.6545898
print(c.grf$concordance)
## [1] 0.5144561
print(c.rfsrc)
## [1] 0.547301
a<- seq(1,501,5)

plot(NA, NA, xlab = "n.trees", ylab = "C-index : 1-OOB error rate",
     xlim = c(0,500),
     ylim = c(0, 1))
# boucle qui augmente le nombre d'arbre 
# et qui calcule l'erreur de prédiction sur la base de test
k<-0
for (k in a){
  oob.rfsrc<-rfsrc(Surv(time, status) ~ age+sex+ph.ecog+ph.karno+pat.karno+meal.cal+wt.loss , 
  data = cancer.train, ntree=k, mtry=7,
  nodesize=15,forest=TRUE)
  err<-1-oob.rfsrc$err.rate[k]
  points(k,err,col="red")
  
  oob.grf <- survival_forest(X.train, Y.train, D.train, num.trees=k)
  pred.oob.grf <- predict(oob.grf, prediction.type = "Nelson-Aalen")
  chf.score <- rowSums(-log(pred.oob.grf$predictions))
  c<-concordance(Surv(Y.train, D.train) ~ chf.score, reverse = TRUE)
  c.index<-c$concordance
  points(k,c.index,col="blue")
}
legend("topright", legend = c("C-index : RFSRC", "C-index : GRF"), fill = c("red","blue"))

2.4.3 Brier score

2.4.3.0.1 Brier score

The Brier Score is a mathematical way to assess the accuracy of probabilistic predictions.
The formula for the Brier Score is:

\[ \mathrm{BS}\left(t^*\right)=\frac{1}{n} \sum_{i=1}^n\left(I\left(T_i>t^*\right)-\hat{\pi}\left(t^* \mid \tilde{X}_i\right)\right)^2 \]

Here:

  • \(BS\) is the Brier Score.

  • \(n\) is the number of events.

  • \(\hat{\pi}\left(t^* | {X}_i\right)\) is the predicted survival function at time t for individual \(i\).

  • \(I\left(T_i>t^*\right)\) is the actual outcome at the time t for all the individual at risk (1 for alive, 0 for death at time \(t\)).

In case of right censoring, the Brier Score has to incorporate the inverse probability of censoring weights (IPCW can be estimated by Kaplan meier estimator).

In simple terms, the Brier Score measures the mean squared difference between predicted probabilities and actual outcomes. A lower Brier Score indicates better predictive accuracy, with 0 being a perfect score.

# COX
BrierScore.cox <- predErr(Surv.rsp, Surv.rsp.new, lp, lpnew, times,
type = "brier", int.type = "weighted")
plot(BrierScore.cox)
abline(h = 0.25, col = "blue")


# RFSRC
rfsrc <- get.brier.survival(forest_cancer.rfsrc, cens.model = "km")$brier.score
lines(rfsrc,col = "green")

legend("topright", legend = c("cox", "rfsrc"), fill = c("red","green"))

# GRF
#rfsrc <- get.brier.survival(forest_cancer.grf, cens.model = "km")$brier.score
#lines(rfsrc,col = "green")

It does not exist brier score for computing survival forest grf but can be implemented.

3 Conclusion

In this R notebook, we explored three different types of estimators: non-parametric, semi-parametric, and parametric. Each estimator provides unique insights into the survival patterns of the dataset.

The Kaplan-Meier estimator, a non-parametric method, allowed us to estimate the survival function without making assumptions about the underlying distribution. This method is particularly useful for capturing the overall trend of survival probabilities over time.

Following this approach, the Survival Random Forest was introduced as a non-parametric extension, harnessing the power of random forests to effectively handle heterogeneous data while preserving robustness against censoring.

The Cox Proportional-Hazards model, a semi-parametric approach, introduced covariates to the analysis, enabling us to assess the impact of various factors on survival. The proportional hazards assumption was evaluated using the Schoenfeld residuals, ensuring the validity of our model.

Additionally, we explored parametric survival models, such as the Weibull distribution, which assumes a specific functional form for the hazard function. This provides a more flexible approach, but it requires the assumption of a particular distribution.

Evaluation metrics such as AUC, C-index and Brier score were employed to compare the goodness-of-fit of different models.

Verification of assumptions is crucial in survival analysis. The log-rank test, Cramer-von Mises statistic, and Anderson-Darling statistic were employed to assess the goodness-of-fit of non-parametric models. The proportional hazards assumption in the Cox model was verified using the Schoenfeld residuals.

In conclusion, this analysis provides a comprehensive understanding of survival patterns in the dataset. The choice between non-parametric, semi-parametric, or parametric models depends on the characteristics of the data and the assumptions one is willing to make. Rigorous evaluation and verification of assumptions are essential for ensuring the reliability of the survival analysis results.

4 Annex

4.1 Parametric estimators

In the case of parametric estimators, the instantaneous risk and survival functions are defined by mathematical functions. In the case of binary survival models, the Weibull distribution (monotonic instantaneous risk) and the exponential distribution (constant instantaneous risk) are the most commonly used.

Several packages are available for parametric analysis:

The fitdistrplus package enables maximum likelihood calculation of a binary survival model.

# ce code permet de calculer l'estimateur paramétrique de Weibull pour la base de donnée "Cancer"
# Installer les packages si ce n'est pas déjà fait
# install.packages(c("survival", "fitdistrplus"))


# Ajuster un modèle de survie Weibull avec fitdistrplus
fitW <- fitdist(cancer$time, "weibull")
fitg <- fitdist(cancer$time, "gamma")
fitln <- fitdist(cancer$time, "lnorm")
plot(fitW)

summary(fitW)
## Fitting of the distribution ' weibull ' by maximum likelihood 
## Parameters : 
##         estimate  Std. Error
## shape   1.494423  0.09039443
## scale 342.564506 18.64947299
## Loglikelihood:  -1106.894   AIC:  2217.788   BIC:  2224.024 
## Correlation matrix:
##           shape     scale
## shape 1.0000000 0.3080232
## scale 0.3080232 1.0000000
summary(fitg)
## Fitting of the distribution ' gamma ' by maximum likelihood 
## Parameters : 
##          estimate   Std. Error
## shape 1.863945273 0.1779486756
## rate  0.006014909 0.0006393986
## Loglikelihood:  -1108.917   AIC:  2221.833   BIC:  2228.069 
## Correlation matrix:
##           shape      rate
## shape 1.0000000 0.8554124
## rate  0.8554124 1.0000000
summary(fitln)
## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate Std. Error
## meanlog 5.4447213 0.06949096
## sdlog   0.8980212 0.04913726
## Loglikelihood:  -1128.268   AIC:  2260.537   BIC:  2266.773 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1
cdfcomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal"))

denscomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal"))

qqcomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal"))

ppcomp(list(fitW, fitg, fitln), legendtext=c("Weibull", "gamma", "lognormal"))

test<-gofstat(list(fitW, fitg, fitln), fitnames=c("Weibull", "gamma", "lognormal"))
test
## Goodness-of-fit statistics
##                                 Weibull      gamma lognormal
## Kolmogorov-Smirnov statistic 0.07119671 0.08655184 0.1383817
## Cramer-von Mises statistic   0.12927550 0.16445221 0.6462149
## Anderson-Darling statistic   0.73945017 0.98792739 3.9059557
## 
## Goodness-of-fit criteria
##                                 Weibull    gamma lognormal
## Akaike's Information Criterion 2217.788 2221.833  2260.537
## Bayesian Information Criterion 2224.024 2228.069  2266.773
test$kstest
##        Weibull          gamma      lognormal 
## "not rejected" "not rejected"     "rejected"

The first graphprovides a plot of the empirical distribution and each fitted distribution.
The second provides a density plot of each fitted distribution.
The third provides a plot of the quantiles of each theoretical distribution against the quantiles of the empirical distribution.
The last one provides a plot of the probabilities of each fitted distribution against the probabilities of the empirical distribution.

The test of the goodness of fit perform several tests:

  • Test Statistics:

    • Kolmogorov-Smirnov (KS) Statistic: Measures the maximum distance between the empirical cumulative distribution function of your data and that of the theoretical distribution.

    • Cramer-von Mises Statistic: Similar to the KS test but gives more weight to the tails of the distribution.

    • Anderson-Darling Statistic: Sensitive to discrepancies in the tails of the distribution, giving more weight to extreme values.

  • Goodness-of-Fit Criteria:

    • Akaike’s Information Criterion (AIC): A criterion that considers both model fit and complexity. The goal is to minimize AIC.

    • Bayesian Information Criterion (BIC): Similar to AIC but penalizes more for complex models. Like AIC, lower BIC is preferred.

In summary, these measures help you choose the distribution that best describes your data. Weibull seems to be more adapted based on the statistical results to fit the data

4.2 Log-rank splitting

The default splitting rule used by the package is the log-rank test statistic and is specified by splitrule=“logrank” . The log-rank test has traditionally been used for two-sample testing with survival data, but it can be used for survival splitting as a means for maximizing between-node survival differences [2-6].

To explain log-rank splitting, consider a specific tree node to be split. Without loss of generality let us assume this is the root node (top of the tree). For simplicity assume the data is not bootstrapped, thus the root node data is \(\left(T_1, \mathbf{X}_1, \delta_1\right), \ldots,\left(T_n, \mathbf{X}_n, \delta_n\right)\). Let \(X\) denote a specific variable (i.e., one of the coordinates of the feature vector). A proposed split using \(X\) is of the form \(X \leq c\) and \(X>c\) (for simplicity we assume \(X\) is nominal) and splits the node into left and right daughters, \(L=\left\{X_i \leq c\right\}\) and \(R=\left\{X_i>c\right\}\), respectively. Let \[ t_1<t_2<\cdots<t_m \] be the distinct death times and let \(d_{j, L}, d_{j, R}\) and \(Y_{j, L}, Y_{j, R}\) equal the number of deaths and individuals at risk at time \(t_j\) in daughter nodes \(L, R\). At risk means the number of individuals in a daughter who are alive at time \(t_j\), or who have an event (death) at time \(t_j\) : \[ Y_{j, L}=\#\left\{T_i \geq t_j, X_i \leq c\right\}, \quad Y_{j, R}=\#\left\{T_i \geq t_j, X_i>c\right\} \]

Define \[ Y_j=Y_{j, L}+Y_{j, R}, \quad d_j=d_{j, L}+d_{j, R} \]

The log-rank test split-rule is based on a null hypothesis that :

  • Null hypothesis : \(H_{0}\) : \(\lambda_{L}(t)=\lambda_{R}(t)\) for all \(t\).

The log-rank split-statistic value for the split is \[ L(X, c)=\frac{\sum_{j=1}^m\left(d_{j, L}-Y_{j, L} \frac{d_j}{Y_j}\right)}{\sqrt{\sum_{j=1}^m \frac{Y_{j, L}}{Y_j}\left(1-\frac{Y_{j, L}}{Y_j}\right)\left(\frac{Y_j-d_j}{Y_j-1}\right) d_j}} . \]

The numerator compute the difference between observed and expected number of deaths in the left daughter node. The denominator is a standard error of the numerator.

Thus, the log-rank split-statistic is a standardized difference between the observed and expected number of deaths in the left daughter node under the null hypothesis.

The value \(|L(X, c)|\) is a measure of node separation. The larger the value, the greater the survival difference between \(L\) and \(R\), and the better the split is. The best split is determined by finding the feature \(X^*\) and split-value \(c^*\) such that \(\left|L\left(X^*, c^*\right)\right| \geq|L(X, c)|\) for all \(X\) and \(c\).