In this project I will predict the outcome death during an ICU stay.
This project will include the following sections:
In this section I will define the cohort I will use to create and evaluate my model.
First I need to connect to the necessary data tables in the Mimic3 Demo data
con <- DBI::dbConnect(drv = bigquery(),
project = "learnclinicaldatascience")
admissions <- tbl(con, 'mimic3_demo.ADMISSIONS') %>%
collect()
## Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## https://gargle.r-lib.org/articles/non-interactive-auth.html
## The bigrquery package is using a cached token for hayley.idance@gmail.com.
patients <- tbl(con, 'mimic3_demo.PATIENTS') %>%
mutate(dob_char = sql("CAST(DOB as STRING)")) %>%
collect()
icustays <- tbl(con, "mimic3_demo.ICUSTAYS") %>%
collect()
diagnoses_icd <- tbl(con, "mimic3_demo.DIAGNOSES_ICD") %>%
collect()
chartevents <- tbl(con, "mimic3_demo.CHARTEVENTS") %>%
collect()
d_items <- tbl(con, "mimic3_demo.D_ITEMS") %>%
collect()
d_icd_diagnoses <- tbl(con, "mimic3_demo.D_ICD_DIAGNOSES") %>%
collect()
We know that there are patients with multiple hospitalization and multiple ICU stays. Since we are looking for death, it would make sense that the stay of interest is the last ICU stay of the last hospitalization.
# create a list of the last hospitalization for each patient
last_hospitalization <- admissions %>%
group_by(SUBJECT_ID) %>%
filter(ADMITTIME == max(ADMITTIME)) %>%
ungroup() %>%
select(SUBJECT_ID, HADM_ID)
# double check we hae one per patient
last_hospitalization %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID))
## # A tibble: 1 x 2
## n_patients n_hosp
## <int> <int>
## 1 100 100
Good we have now collected the last hospitalization for each patient. Now we want to isolate just the last ICU stay.
analytic_dataset <- icustays %>%
inner_join(last_hospitalization, by = c("SUBJECT_ID" = "SUBJECT_ID", "HADM_ID" = "HADM_ID")) %>%
group_by(SUBJECT_ID, HADM_ID) %>%
filter(INTIME == max(INTIME)) %>%
ungroup() %>%
select(SUBJECT_ID, HADM_ID, ICUSTAY_ID)
# double check that we have 1 hospital stay and 1 ICU stay per patient
analytic_dataset %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID),
n_icu = n_distinct(ICUSTAY_ID))
## # A tibble: 1 x 3
## n_patients n_hosp n_icu
## <int> <int> <int>
## 1 100 100 100
Great, we have now defined our cohort.
Now I need to collect the variables that I will be using in my model. I will need the following:
I will be using the HOSPITAL_EXPIRE_FLAG within the admissions table. First I will collect this data, then I will add whether there was a death flag into my analytic dataset.
death_during_stay <- admissions %>%
select(SUBJECT_ID, HADM_ID, HOSPITAL_EXPIRE_FLAG)
# make sure joining the datasets will keep only 1 entry per patient
analytic_dataset %>%
left_join(death_during_stay) %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID),
n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = c("SUBJECT_ID", "HADM_ID")
## # A tibble: 1 x 3
## n_patients n_hosp n_icu
## <int> <int> <int>
## 1 100 100 100
# looks safe, will join and then count deaths to see how many there are
analytic_dataset <- analytic_dataset %>%
left_join(death_during_stay)
## Joining, by = c("SUBJECT_ID", "HADM_ID")
analytic_dataset %>%
count(HOSPITAL_EXPIRE_FLAG)
## # A tibble: 2 x 2
## HOSPITAL_EXPIRE_FLAG n
## <int> <int>
## 1 0 60
## 2 1 40
So it looks like 40 of the 100 patients died during their last hospital stay.
I am choosing to use oxygen saturation as a predictor in my model. I think that it will be informative for patients who pass. I think it is most relevant to use the worst measurement in the last 12 hours of their stay as this will likely capture those who were declining near the end of their stay.
Using the code from the exercise in week 13, we can recall that the pulse ox item codes are 646 and 220277
# collect time of discharge (end of stay) to create the 12 hour window
icu_discharge_time <- icustays %>%
select(SUBJECT_ID, HADM_ID, ICUSTAY_ID, OUTTIME)
# want to make sure first that there aren't missing discharge times
summary(icu_discharge_time$OUTTIME)
## Min. 1st Qu. Median
## "2102-09-01 20:19:42" "2128-09-09 22:55:46" "2151-02-20 19:24:34"
## Mean 3rd Qu. Max.
## "2154-04-07 21:27:08" "2180-04-19 14:07:41" "2202-10-04 17:07:38"
# doesn't look like there are NAs
d_items %>%
filter(str_detect(LABEL, pattern = regex("SpO2|pulseox", ignore_case = TRUE)))
## # A tibble: 8 x 10
## ROW_ID ITEMID LABEL ABBREVIATION DBSOURCE LINKSTO CATEGORY UNITNAME PARAM_TYPE
## <int> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1923 5820 SpO2… <NA> carevue charte… <NA> <NA> <NA>
## 2 599 646 SpO2 <NA> carevue charte… <NA> <NA> <NA>
## 3 4343 6719 SpO2… <NA> carevue charte… <NA> <NA> <NA>
## 4 4998 8554 SpO2… <NA> carevue charte… <NA> <NA> <NA>
## 5 12766 223770 O2 S… SpO2 Alarm … metavis… charte… Alarms % Numeric
## 6 12765 223769 O2 S… SpO2 Alarm … metavis… charte… Alarms % Numeric
## 7 14442 226253 SpO2… SpO2 Desat … metavis… charte… Alarms % Numeric
## 8 12746 220277 O2 s… SpO2 metavis… charte… Respira… % Numeric
## # … with 1 more variable: CONCEPTID <int>
# gather the pulse ox information and filter to those taken in the last 12 hours of stay
pulse_ox <- chartevents %>%
filter(ITEMID %in% c(646, 220277)) %>%
select(SUBJECT_ID, HADM_ID, ICUSTAY_ID, CHARTTIME, VALUENUM) %>%
inner_join(icu_discharge_time) %>%
# determine start of 12 hour window
mutate(start_time = OUTTIME - 12*3600) %>%
filter(CHARTTIME >= start_time & CHARTTIME <= OUTTIME) %>%
# get the min value for each person
group_by(SUBJECT_ID, HADM_ID, ICUSTAY_ID) %>%
summarise(min_last12hr_pulseox = min(VALUENUM, na.rm = T))
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
## `summarise()` regrouping output by 'SUBJECT_ID', 'HADM_ID' (override with `.groups` argument)
Now we join this first clinical variable with our analytic dataset
# as always make sure that the join will go as expected first
analytic_dataset %>%
left_join(pulse_ox) %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID),
n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
## # A tibble: 1 x 3
## n_patients n_hosp n_icu
## <int> <int> <int>
## 1 100 100 100
# Okay that looks fine, so we will add it
analytic_dataset <- analytic_dataset %>%
left_join(pulse_ox)
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
summary(analytic_dataset$min_last12hr_pulseox)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 83.75 93.00 80.71 95.00 100.00 8
Missing data: There are 8 people with missing data as NA
I will be looking for an ICD diagnosis of sepsis during the last hospital stay.
First I need to find the ICD diagnosis code(s) for sepsis
d_icd_diagnoses %>%
filter(str_detect(LONG_TITLE, pattern = regex("sepsis", ignore_case = TRUE)))
## # A tibble: 6 x 4
## ROW_ID ICD9_CODE SHORT_TITLE LONG_TITLE
## <int> <chr> <chr> <chr>
## 1 11403 99591 Sepsis Sepsis
## 2 11404 99592 Severe sepsis Severe sepsis
## 3 13564 67020 Puerperal sepsis-u… Puerperal sepsis, unspecified as to epis…
## 4 9050 77181 NB septicemia [sep… Septicemia [sepsis] of newborn
## 5 13566 67024 Puerperl sepsis-po… Puerperal sepsis, postpartum condition o…
## 6 13565 67022 Puerprl sepsis-del… Puerperal sepsis, delivered, with mentio…
Ok I will use Sepsis and Severe sepsis which are codes 99591 and 99592
sepsis <- diagnoses_icd %>%
filter(ICD9_CODE %in% c("99591", "99592")) %>%
distinct(SUBJECT_ID) %>%
mutate(sepsis = 1)
There are 23 diagnoses of sepsis. Now join it to the analytic dataset
# check that adding it is safe
analytic_dataset %>%
left_join(sepsis) %>%
mutate(sepsis = coalesce(sepsis, 0)) %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID),
n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = "SUBJECT_ID"
## # A tibble: 1 x 3
## n_patients n_hosp n_icu
## <int> <int> <int>
## 1 100 100 100
# great its safe so go ahead and save it
analytic_dataset <- analytic_dataset %>%
left_join(sepsis) %>%
mutate(sepsis = coalesce(sepsis, 0))
## Joining, by = "SUBJECT_ID"
summary(analytic_dataset$sepsis)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.23 0.00 1.00
Missing data: There are no patients with missing data (NAs) in this variable. Those with no sepsis diagnosis are coded as a 0.
I am also interested in using age and sex as predictors, so I will extract those variables too.
Starting with sex, we will extract GENDER and code male = 1, female = 0.
male <- patients %>%
select(SUBJECT_ID, GENDER) %>%
mutate(male = case_when(GENDER == "M" ~ 1,
TRUE ~ 0))
analytic_dataset %>%
left_join(male) %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID),
n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = "SUBJECT_ID"
## # A tibble: 1 x 3
## n_patients n_hosp n_icu
## <int> <int> <int>
## 1 100 100 100
# safe to join
analytic_dataset <- analytic_dataset %>%
left_join(male)
## Joining, by = "SUBJECT_ID"
Now for age. I will calculate age upon discharge. I already gather discharge time.
age_at_discharge <- patients %>%
select(SUBJECT_ID, DOB, dob_char) %>%
left_join(icu_discharge_time) %>%
mutate(age_at_discharge = round(as.numeric((OUTTIME - DOB)/365.25)))
## Joining, by = "SUBJECT_ID"
analytic_dataset %>%
left_join(age_at_discharge) %>%
summarise(n_patients = n_distinct(SUBJECT_ID),
n_hosp = n_distinct(HADM_ID),
n_icu = n_distinct(ICUSTAY_ID))
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
## # A tibble: 1 x 3
## n_patients n_hosp n_icu
## <int> <int> <int>
## 1 100 100 100
# safe to join
analytic_dataset <- analytic_dataset %>%
left_join(age_at_discharge)
## Joining, by = c("SUBJECT_ID", "HADM_ID", "ICUSTAY_ID")
summary(analytic_dataset$male)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.45 1.00 1.00
summary(analytic_dataset$age_at_discharge)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 17.00 64.00 76.00 70.22 83.00 89.00 8
Missing data: There are no patients with missing gender data. There are 8 NAs indicating that these patients were missing either DOB data or did not have a discharge time. Since I already checked that there are no patients missing discharge data, this means that there are 8 patients missing DOB.
To create and validate my model I will be splitting my data into training and testing (70/30). The training set will be used to build my model. The testing set will then be used to evaluate the model’s performance.
library(rsample)
set.seed(2020)
data_split <- initial_split(analytic_dataset, prop = 7/10)
training_data <- training(data_split)
testing_data <- testing(data_split)
training_data
## # A tibble: 70 x 12
## SUBJECT_ID HADM_ID ICUSTAY_ID HOSPITAL_EXPIRE… min_last12hr_pu… sepsis GENDER
## <int> <int> <int> <int> <dbl> <dbl> <chr>
## 1 10017 199207 204881 0 90 0 F
## 2 10074 170119 224021 0 96 0 M
## 3 10088 149044 277403 0 95 1 M
## 4 10027 199395 286020 0 90 0 F
## 5 10106 133283 217960 0 96 0 M
## 6 10043 168674 266122 0 91 0 M
## 7 10042 148562 258147 0 NA 0 M
## 8 10127 182839 271544 0 99 0 F
## 9 10064 111761 231809 1 0 0 M
## 10 10111 174739 263934 1 95 0 F
## # … with 60 more rows, and 5 more variables: male <dbl>, DOB <dttm>,
## # dob_char <chr>, OUTTIME <dttm>, age_at_discharge <dbl>
testing_data
## # A tibble: 30 x 12
## SUBJECT_ID HADM_ID ICUSTAY_ID HOSPITAL_EXPIRE… min_last12hr_pu… sepsis GENDER
## <int> <int> <int> <int> <dbl> <dbl> <chr>
## 1 10036 189483 227834 1 80 1 F
## 2 10040 157839 272047 0 89 0 F
## 3 10065 183314 245091 0 95 0 F
## 4 10130 156668 241562 0 98 0 M
## 5 10093 165393 251573 1 39 0 M
## 6 10011 105331 232110 1 94 0 F
## 7 10094 122928 273347 1 96 1 M
## 8 10033 157235 254543 0 94 0 F
## 9 10026 103770 277021 0 88 0 F
## 10 10102 164869 223870 1 90 0 M
## # … with 20 more rows, and 5 more variables: male <dbl>, DOB <dttm>,
## # dob_char <chr>, OUTTIME <dttm>, age_at_discharge <dbl>
I will build a logistic regression model using the training data and the glm() function in R. I will use HOSPITAL_EXPIRE_FLAG as my outcome and male, age_at_discharge, min_last12hr_pulseox, and sepsis as my predictors. I have selected this model because it is easy to interpret and because I have very few predictors, so there is no need to perform feature selection/reduction.
model <- training_data %>%
glm(formula = HOSPITAL_EXPIRE_FLAG ~ male + age_at_discharge + sepsis + min_last12hr_pulseox,
family = "binomial")
summary(model)
##
## Call:
## glm(formula = HOSPITAL_EXPIRE_FLAG ~ male + age_at_discharge +
## sepsis + min_last12hr_pulseox, family = "binomial", data = .)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.59388 -0.64134 -0.47456 0.00024 2.08779
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 18.725449 8.082914 2.317 0.0205 *
## male -0.618000 0.791606 -0.781 0.4350
## age_at_discharge -0.007238 0.024559 -0.295 0.7682
## sepsis -0.900937 1.147897 -0.785 0.4325
## min_last12hr_pulseox -0.205950 0.081182 -2.537 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79.763 on 60 degrees of freedom
## Residual deviance: 47.336 on 56 degrees of freedom
## (9 observations deleted due to missingness)
## AIC: 57.336
##
## Number of Fisher Scoring iterations: 8
Now I will use my model to predict the probability that the patients in my training data will die during their stay.
training_data$predicted_outcome <- predict(model, training_data, type = "response")
I will plot the ROC curve for the training data and calculate AUC
library(plotROC)
training_roc <- training_data %>%
ggplot(aes(m = predicted_outcome, d = HOSPITAL_EXPIRE_FLAG)) +
geom_roc(n.cuts = 10, labels = F, labelround = 4) +
style_roc(theme = theme_gray)
training_roc
calc_auc(training_roc)$AUC*100
## [1] 87.47086
Okay so an AUC of 90.6 is pretty good. I did intentionally select the last ICU stay from the last hospitalization so that helps. I also intentionally selected minimum (worst) pulse O2 as a predictor in the last 12 hours, because I thought that people of declining health would have lower oxygen saturation. I also thought sepsis was common in people who die in ICU, so I used that diagnosis code. This is in the training data, which was used to create the model, so this performance is better than what is expected of random or outside data.
I will be using holdout to validate my model. This was done by splitting my data into training and testing data. The model has not seen the testing data yet, so it can be used to test the performance of the model.
testing_data$predicted_outcome <- predict(model, testing_data, type = "response")
testing_roc <- testing_data %>%
ggplot(aes(m = predicted_outcome, d = HOSPITAL_EXPIRE_FLAG)) +
geom_roc(n.cuts = 10, labels = F, labelround = 4) +
style_roc(theme = theme_gray)
testing_roc
calc_auc(testing_roc)$AUC*100
## [1] 92.5
As expected, this model performed worse on the testing data than it did on the training data. An AUC of 79.4 is not terrible.
Overall, I think this model is pretty good. It would definitely need some more thoughtful feature selection in order to be clinical useful. It makes sense to use the last ICU stay of the last hospitalization for examining death, since patients who die could have prior stays but if they die, it would occur in their last visit. I think that my use of oxygen saturation during the last 12 hours was clinically relevant. However, my use of sepsis, while useful, is probably not good enough on its own. Someone with more clinical expertise may be abe to inform me of other diagnoses that often occur soon before or with death in an ICU.
I do see that 12 observations were dropped in my model due to missingness. One of the main limitations of basic logistic regression is that it cannot have missing data. We know that the missing data is in either DOB or min_last12hr_pulseox. In order for this model to be useful, it would likely need to be trained on a much larger sample.
training_data %>%
select(-predicted_outcome) %>%
filter_all(any_vars(is.na(.)))
## # A tibble: 9 x 12
## SUBJECT_ID HADM_ID ICUSTAY_ID HOSPITAL_EXPIRE… min_last12hr_pu… sepsis GENDER
## <int> <int> <int> <int> <dbl> <dbl> <chr>
## 1 10042 148562 258147 0 NA 0 M
## 2 10019 177759 228977 1 NA 1 M
## 3 10067 160442 236674 1 NA 0 M
## 4 10059 122098 248755 1 NA 1 M
## 5 10120 193924 268282 1 NA 1 M
## 6 10076 198503 201006 1 NA 1 M
## 7 44154 174245 217724 1 66 1 M
## 8 40655 126002 220016 0 92 1 F
## 9 42281 195911 221684 1 NA 0 F
## # … with 5 more variables: male <dbl>, DOB <dttm>, dob_char <chr>,
## # OUTTIME <dttm>, age_at_discharge <dbl>
8 of the removed individuals were missing the DOB and therefore the age predictor. This could be due to being too sick to provide age upon admission. If this was the case we could have missed the most sick patients (and therefore potentially most likely to have died). 6 of the patients removed did die during their last stay.
Based on how I built my model with a simple logistic regression and my selection of features, I would say that I built an operational model. Since I used the last ICU stay of the last hospitalization and I used data from the last 12 hours of the stay, by the time the results of this model were calculated, the patient would have already left the hospital or ICU or potentially died. Therefore this cannot be a clinical model. This model could be useful for informing trends about diagnoses and oxygen saturation levels leading to death, which could inform decisions about staffing or bed booking in a hospital. I think the user could likely be clinic administrators who decide on staffing.
I think this is overall a pretty simple model. However, it does require some algorithm training (it is not a simple presence/absence of a flag), so it would need to be run externally. It would require access to patient data in the EHR. It it very quick to run, due to its limited predictor set, so it could likely be run very frequently if desired.