Predicting death during ICU stay

In this project I will predict the outcome death during an ICU stay.

This project will include the following sections:

Cohort Definition

In this section I will define the cohort I will use to create and evaluate my model.

First I need to connect to the necessary data tables in the Mimic3 Demo data

con <- DBI::dbConnect(drv = bigquery(),
                      project = "learnclinicaldatascience")

admissions <- tbl(con, 'mimic3_demo.ADMISSIONS') %>% 
  collect()
## Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## https://gargle.r-lib.org/articles/non-interactive-auth.html
## The bigrquery package is using a cached token for hayley.idance@gmail.com.
patients <- tbl(con, 'mimic3_demo.PATIENTS') %>% 
  mutate(dob_char = sql("CAST(DOB as STRING)")) %>% 
  collect()
icustays <- tbl(con, "mimic3_demo.ICUSTAYS") %>% 
  collect()
diagnoses_icd <- tbl(con, "mimic3_demo.DIAGNOSES_ICD") %>% 
  collect()
chartevents <- tbl(con, "mimic3_demo.CHARTEVENTS") %>% 
  collect()
d_items <- tbl(con, "mimic3_demo.D_ITEMS") %>% 
  collect()
d_icd_diagnoses <- tbl(con, "mimic3_demo.D_ICD_DIAGNOSES") %>%
  collect()

We know that there are patients with multiple hospitalization and multiple ICU stays. Since we are looking for death, it would make sense that the stay of interest is the last ICU stay of the last hospitalization.

# create a list of the last hospitalization for each patient
last_hospitalization <- admissions %>%
  group_by(SUBJECT_ID) %>%
  filter(ADMITTIME == max(ADMITTIME)) %>%
  ungroup() %>%
  select(SUBJECT_ID, HADM_ID)

# double check we hae one per patient
last_hospitalization %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID))
## # A tibble: 1 x 2
##   n_patients n_hosp
##        <int>  <int>
## 1        100    100

Good we have now collected the last hospitalization for each patient. Now we want to isolate just the last ICU stay.

analytic_dataset <- icustays %>%
  inner_join(last_hospitalization, by = c("SUBJECT_ID" = "SUBJECT_ID", "HADM_ID" = "HADM_ID")) %>%
  group_by(SUBJECT_ID, HADM_ID) %>%
  filter(INTIME == max(INTIME)) %>%
  ungroup() %>%
  select(SUBJECT_ID, HADM_ID, ICUSTAY_ID)

# double check that we have 1 hospital stay and 1 ICU stay per patient
analytic_dataset %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID),
            n_icu = n_distinct(ICUSTAY_ID))
## # A tibble: 1 x 3
##   n_patients n_hosp n_icu
##        <int>  <int> <int>
## 1        100    100   100

Great, we have now defined our cohort.

Return to top

Feature Selection

Now I need to collect the variables that I will be using in my model. I will need the following:

  • Outcome variable (death during ICU stay)
  • Clinical variable #1 (lowest blood oxygenation in last 12 hours of stay)
  • Clinical variable #2 (Diagnosis of sepsis during stay)
  • Other demographic data (age, gender)

Outcome Variable: Death during ICU stay

I will be using the HOSPITAL_EXPIRE_FLAG within the admissions table. First I will collect this data, then I will add whether there was a death flag into my analytic dataset.

death_during_stay <- admissions %>%
  select(SUBJECT_ID, HADM_ID, HOSPITAL_EXPIRE_FLAG)

# make sure joining the datasets will keep only 1 entry per patient
analytic_dataset %>%
  left_join(death_during_stay) %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID),
            n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = c("SUBJECT_ID", "HADM_ID")
## # A tibble: 1 x 3
##   n_patients n_hosp n_icu
##        <int>  <int> <int>
## 1        100    100   100
# looks safe, will join and then count deaths to see how many there are

analytic_dataset <- analytic_dataset %>%
  left_join(death_during_stay)
## Joining, by = c("SUBJECT_ID", "HADM_ID")
analytic_dataset %>%
  count(HOSPITAL_EXPIRE_FLAG)
## # A tibble: 2 x 2
##   HOSPITAL_EXPIRE_FLAG     n
##                  <int> <int>
## 1                    0    60
## 2                    1    40

So it looks like 40 of the 100 patients died during their last hospital stay.

Clinical variable #1: Lowest blood oxygenation in last 12 hours of stay

I am choosing to use oxygen saturation as a predictor in my model. I think that it will be informative for patients who pass. I think it is most relevant to use the worst measurement in the last 12 hours of their stay as this will likely capture those who were declining near the end of their stay.

Using the code from the exercise in week 13, we can recall that the pulse ox item codes are 646 and 220277

# collect time of discharge (end of stay) to create the 12 hour window
icu_discharge_time <- icustays %>%
  select(SUBJECT_ID, HADM_ID, ICUSTAY_ID, OUTTIME)

# want to make sure first that there aren't missing discharge times
summary(icu_discharge_time$OUTTIME)
##                  Min.               1st Qu.                Median 
## "2102-09-01 20:19:42" "2128-09-09 22:55:46" "2151-02-20 19:24:34" 
##                  Mean               3rd Qu.                  Max. 
## "2154-04-07 21:27:08" "2180-04-19 14:07:41" "2202-10-04 17:07:38"
# doesn't look like there are NAs

d_items %>% 
  filter(str_detect(LABEL, pattern = regex("SpO2|pulseox", ignore_case = TRUE)))
## # A tibble: 8 x 10
##   ROW_ID ITEMID LABEL ABBREVIATION DBSOURCE LINKSTO CATEGORY UNITNAME PARAM_TYPE
##    <int>  <int> <chr> <chr>        <chr>    <chr>   <chr>    <chr>    <chr>     
## 1   1923   5820 SpO2… <NA>         carevue  charte… <NA>     <NA>     <NA>      
## 2    599    646 SpO2  <NA>         carevue  charte… <NA>     <NA>     <NA>      
## 3   4343   6719 SpO2… <NA>         carevue  charte… <NA>     <NA>     <NA>      
## 4   4998   8554 SpO2… <NA>         carevue  charte… <NA>     <NA>     <NA>      
## 5  12766 223770 O2 S… SpO2 Alarm … metavis… charte… Alarms   %        Numeric   
## 6  12765 223769 O2 S… SpO2 Alarm … metavis… charte… Alarms   %        Numeric   
## 7  14442 226253 SpO2… SpO2 Desat … metavis… charte… Alarms   %        Numeric   
## 8  12746 220277 O2 s… SpO2         metavis… charte… Respira… %        Numeric   
## # … with 1 more variable: CONCEPTID <int>
# gather the pulse ox information and filter to those taken in the last 12 hours of stay
pulse_ox <- chartevents %>%
  filter(ITEMID %in% c(646, 220277)) %>%
  select(SUBJECT_ID, HADM_ID, ICUSTAY_ID, CHARTTIME, VALUENUM) %>%
  inner_join(icu_discharge_time) %>%
  # determine start of 12 hour window
  mutate(start_time = OUTTIME - 12*3600) %>%
  filter(CHARTTIME >= start_time & CHARTTIME <= OUTTIME) %>%
  # get the min value for each person 
  group_by(SUBJECT_ID, HADM_ID, ICUSTAY_ID) %>%
  summarise(min_last12hr_pulseox = min(VALUENUM, na.rm = T))
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
## `summarise()` regrouping output by 'SUBJECT_ID', 'HADM_ID' (override with `.groups` argument)

Now we join this first clinical variable with our analytic dataset

# as always make sure that the join will go as expected first
analytic_dataset %>%
  left_join(pulse_ox) %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID),
            n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
## # A tibble: 1 x 3
##   n_patients n_hosp n_icu
##        <int>  <int> <int>
## 1        100    100   100
# Okay that looks fine, so we will add it

analytic_dataset <- analytic_dataset %>%
  left_join(pulse_ox)
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
summary(analytic_dataset$min_last12hr_pulseox)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   83.75   93.00   80.71   95.00  100.00       8

Missing data: There are 8 people with missing data as NA

Clinical variable #2: Diagnosis of sepsis during stay

I will be looking for an ICD diagnosis of sepsis during the last hospital stay.

First I need to find the ICD diagnosis code(s) for sepsis

d_icd_diagnoses %>% 
  filter(str_detect(LONG_TITLE, pattern = regex("sepsis", ignore_case = TRUE)))
## # A tibble: 6 x 4
##   ROW_ID ICD9_CODE SHORT_TITLE         LONG_TITLE                               
##    <int> <chr>     <chr>               <chr>                                    
## 1  11403 99591     Sepsis              Sepsis                                   
## 2  11404 99592     Severe sepsis       Severe sepsis                            
## 3  13564 67020     Puerperal sepsis-u… Puerperal sepsis, unspecified as to epis…
## 4   9050 77181     NB septicemia [sep… Septicemia [sepsis] of newborn           
## 5  13566 67024     Puerperl sepsis-po… Puerperal sepsis, postpartum condition o…
## 6  13565 67022     Puerprl sepsis-del… Puerperal sepsis, delivered, with mentio…

Ok I will use Sepsis and Severe sepsis which are codes 99591 and 99592

sepsis <- diagnoses_icd %>%
  filter(ICD9_CODE %in% c("99591", "99592")) %>%
  distinct(SUBJECT_ID) %>%
  mutate(sepsis = 1)

There are 23 diagnoses of sepsis. Now join it to the analytic dataset

# check that adding it is safe
analytic_dataset %>% 
  left_join(sepsis) %>%
  mutate(sepsis = coalesce(sepsis, 0)) %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID),
            n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = "SUBJECT_ID"
## # A tibble: 1 x 3
##   n_patients n_hosp n_icu
##        <int>  <int> <int>
## 1        100    100   100
# great its safe so go ahead and save it
analytic_dataset <- analytic_dataset %>% 
  left_join(sepsis) %>%
  mutate(sepsis = coalesce(sepsis, 0))
## Joining, by = "SUBJECT_ID"
summary(analytic_dataset$sepsis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.23    0.00    1.00

Missing data: There are no patients with missing data (NAs) in this variable. Those with no sepsis diagnosis are coded as a 0.

Other variables of interest (age and sex)

I am also interested in using age and sex as predictors, so I will extract those variables too.

Starting with sex, we will extract GENDER and code male = 1, female = 0.

male <- patients %>%
  select(SUBJECT_ID, GENDER) %>%
  mutate(male = case_when(GENDER == "M" ~ 1,
                          TRUE ~ 0))

analytic_dataset %>% 
  left_join(male) %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID),
            n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = "SUBJECT_ID"
## # A tibble: 1 x 3
##   n_patients n_hosp n_icu
##        <int>  <int> <int>
## 1        100    100   100
# safe to join

analytic_dataset <- analytic_dataset %>% 
  left_join(male)
## Joining, by = "SUBJECT_ID"

Now for age. I will calculate age upon discharge. I already gather discharge time.

age_at_discharge <- patients %>%
  select(SUBJECT_ID, DOB, dob_char) %>%
  left_join(icu_discharge_time) %>%
  mutate(age_at_discharge = round(as.numeric((OUTTIME - DOB)/365.25)))
## Joining, by = "SUBJECT_ID"
analytic_dataset %>%
  left_join(age_at_discharge) %>%
  summarise(n_patients = n_distinct(SUBJECT_ID),
            n_hosp = n_distinct(HADM_ID),
            n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
## # A tibble: 1 x 3
##   n_patients n_hosp n_icu
##        <int>  <int> <int>
## 1        100    100   100
# safe to join

analytic_dataset <- analytic_dataset %>%
  left_join(age_at_discharge)
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
summary(analytic_dataset$male)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.45    1.00    1.00
summary(analytic_dataset$age_at_discharge)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   17.00   64.00   76.00   70.22   83.00   89.00       8

Missing data: There are no patients with missing gender data. There are 8 NAs indicating that these patients were missing either DOB data or did not have a discharge time. Since I already checked that there are no patients missing discharge data, this means that there are 8 patients missing DOB.

Return to top

Model

To create and validate my model I will be splitting my data into training and testing (70/30). The training set will be used to build my model. The testing set will then be used to evaluate the model’s performance.

library(rsample)

set.seed(2020)

data_split <- initial_split(analytic_dataset, prop = 7/10)

training_data <- training(data_split)
testing_data <- testing(data_split)

training_data
## # A tibble: 70 x 12
##    SUBJECT_ID HADM_ID ICUSTAY_ID HOSPITAL_EXPIRE… min_last12hr_pu… sepsis GENDER
##         <int>   <int>      <int>            <int>            <dbl>  <dbl> <chr> 
##  1      10017  199207     204881                0               90      0 F     
##  2      10074  170119     224021                0               96      0 M     
##  3      10088  149044     277403                0               95      1 M     
##  4      10027  199395     286020                0               90      0 F     
##  5      10106  133283     217960                0               96      0 M     
##  6      10043  168674     266122                0               91      0 M     
##  7      10042  148562     258147                0               NA      0 M     
##  8      10127  182839     271544                0               99      0 F     
##  9      10064  111761     231809                1                0      0 M     
## 10      10111  174739     263934                1               95      0 F     
## # … with 60 more rows, and 5 more variables: male <dbl>, DOB <dttm>,
## #   dob_char <chr>, OUTTIME <dttm>, age_at_discharge <dbl>
testing_data
## # A tibble: 30 x 12
##    SUBJECT_ID HADM_ID ICUSTAY_ID HOSPITAL_EXPIRE… min_last12hr_pu… sepsis GENDER
##         <int>   <int>      <int>            <int>            <dbl>  <dbl> <chr> 
##  1      10036  189483     227834                1               80      1 F     
##  2      10040  157839     272047                0               89      0 F     
##  3      10065  183314     245091                0               95      0 F     
##  4      10130  156668     241562                0               98      0 M     
##  5      10093  165393     251573                1               39      0 M     
##  6      10011  105331     232110                1               94      0 F     
##  7      10094  122928     273347                1               96      1 M     
##  8      10033  157235     254543                0               94      0 F     
##  9      10026  103770     277021                0               88      0 F     
## 10      10102  164869     223870                1               90      0 M     
## # … with 20 more rows, and 5 more variables: male <dbl>, DOB <dttm>,
## #   dob_char <chr>, OUTTIME <dttm>, age_at_discharge <dbl>

Predictive model: Logistic regression

I will build a logistic regression model using the training data and the glm() function in R. I will use HOSPITAL_EXPIRE_FLAG as my outcome and male, age_at_discharge, min_last12hr_pulseox, and sepsis as my predictors. I have selected this model because it is easy to interpret and because I have very few predictors, so there is no need to perform feature selection/reduction.

model <- training_data %>%
  glm(formula = HOSPITAL_EXPIRE_FLAG ~ male + age_at_discharge + sepsis + min_last12hr_pulseox,
      family = "binomial")

summary(model)
## 
## Call:
## glm(formula = HOSPITAL_EXPIRE_FLAG ~ male + age_at_discharge + 
##     sepsis + min_last12hr_pulseox, family = "binomial", data = .)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.59388  -0.64134  -0.47456   0.00024   2.08779  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          18.725449   8.082914   2.317   0.0205 *
## male                 -0.618000   0.791606  -0.781   0.4350  
## age_at_discharge     -0.007238   0.024559  -0.295   0.7682  
## sepsis               -0.900937   1.147897  -0.785   0.4325  
## min_last12hr_pulseox -0.205950   0.081182  -2.537   0.0112 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 79.763  on 60  degrees of freedom
## Residual deviance: 47.336  on 56  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 57.336
## 
## Number of Fisher Scoring iterations: 8

Now I will use my model to predict the probability that the patients in my training data will die during their stay.

training_data$predicted_outcome <- predict(model, training_data, type = "response")

I will plot the ROC curve for the training data and calculate AUC

library(plotROC)

training_roc <- training_data %>%
  ggplot(aes(m = predicted_outcome, d = HOSPITAL_EXPIRE_FLAG)) +
  geom_roc(n.cuts = 10, labels = F, labelround = 4) +
  style_roc(theme = theme_gray)

training_roc

calc_auc(training_roc)$AUC*100
## [1] 87.47086

Okay so an AUC of 90.6 is pretty good. I did intentionally select the last ICU stay from the last hospitalization so that helps. I also intentionally selected minimum (worst) pulse O2 as a predictor in the last 12 hours, because I thought that people of declining health would have lower oxygen saturation. I also thought sepsis was common in people who die in ICU, so I used that diagnosis code. This is in the training data, which was used to create the model, so this performance is better than what is expected of random or outside data.

Return to top

Validation

I will be using holdout to validate my model. This was done by splitting my data into training and testing data. The model has not seen the testing data yet, so it can be used to test the performance of the model.

testing_data$predicted_outcome <- predict(model, testing_data, type = "response")

testing_roc <- testing_data %>%
  ggplot(aes(m = predicted_outcome, d = HOSPITAL_EXPIRE_FLAG)) +
  geom_roc(n.cuts = 10, labels = F, labelround = 4) +
  style_roc(theme = theme_gray)

testing_roc

calc_auc(testing_roc)$AUC*100
## [1] 92.5

As expected, this model performed worse on the testing data than it did on the training data. An AUC of 79.4 is not terrible.

Return to top

Usability

Overall, I think this model is pretty good. It would definitely need some more thoughtful feature selection in order to be clinical useful. It makes sense to use the last ICU stay of the last hospitalization for examining death, since patients who die could have prior stays but if they die, it would occur in their last visit. I think that my use of oxygen saturation during the last 12 hours was clinically relevant. However, my use of sepsis, while useful, is probably not good enough on its own. Someone with more clinical expertise may be abe to inform me of other diagnoses that often occur soon before or with death in an ICU.

I do see that 12 observations were dropped in my model due to missingness. One of the main limitations of basic logistic regression is that it cannot have missing data. We know that the missing data is in either DOB or min_last12hr_pulseox. In order for this model to be useful, it would likely need to be trained on a much larger sample.

training_data %>%
  select(-predicted_outcome) %>%
  filter_all(any_vars(is.na(.)))
## # A tibble: 9 x 12
##   SUBJECT_ID HADM_ID ICUSTAY_ID HOSPITAL_EXPIRE… min_last12hr_pu… sepsis GENDER
##        <int>   <int>      <int>            <int>            <dbl>  <dbl> <chr> 
## 1      10042  148562     258147                0               NA      0 M     
## 2      10019  177759     228977                1               NA      1 M     
## 3      10067  160442     236674                1               NA      0 M     
## 4      10059  122098     248755                1               NA      1 M     
## 5      10120  193924     268282                1               NA      1 M     
## 6      10076  198503     201006                1               NA      1 M     
## 7      44154  174245     217724                1               66      1 M     
## 8      40655  126002     220016                0               92      1 F     
## 9      42281  195911     221684                1               NA      0 F     
## # … with 5 more variables: male <dbl>, DOB <dttm>, dob_char <chr>,
## #   OUTTIME <dttm>, age_at_discharge <dbl>

8 of the removed individuals were missing the DOB and therefore the age predictor. This could be due to being too sick to provide age upon admission. If this was the case we could have missed the most sick patients (and therefore potentially most likely to have died). 6 of the patients removed did die during their last stay.

Algorithm interpretation

Based on how I built my model with a simple logistic regression and my selection of features, I would say that I built an operational model. Since I used the last ICU stay of the last hospitalization and I used data from the last 12 hours of the stay, by the time the results of this model were calculated, the patient would have already left the hospital or ICU or potentially died. Therefore this cannot be a clinical model. This model could be useful for informing trends about diagnoses and oxygen saturation levels leading to death, which could inform decisions about staffing or bed booking in a hospital. I think the user could likely be clinic administrators who decide on staffing.

Algorithm implementation

I think this is overall a pretty simple model. However, it does require some algorithm training (it is not a simple presence/absence of a flag), so it would need to be run externally. It would require access to patient data in the EHR. It it very quick to run, due to its limited predictor set, so it could likely be run very frequently if desired.

Return to top