Setup

I first need to load the required packages and data

library(tidyverse)
library(magrittr)
library(bigrquery)

con <- DBI::dbConnect(drv = bigquery(),
                      project = "learnclinicaldatascience")
diabetes_notes <- tbl(con, "course4_data.diabetes_notes") %>% 
  collect()

goldstandard <- tbl(con, "course4_data.diabetes_goldstandard") %>%
  collect()

I need to describe my process using the following sections:

Approach
Regular Expression (s)
Performance
Reflection

Note exploration

I will first explore the structure and contents of one or more of the notes to try to decide on a good approach.

c(diabetes_notes[1,])

## $NOTE_ID
## [1] 1
## 
## $NOTE_TYPE
## [1] "History and Physical"
## 
## $TEXT
## [1] "CHIEF COMPLAINT:  Dog bite to his right lower leg.\n\nHISTORY OF PRESENT ILLNESS:  This 50-year-old white male earlier this afternoon was attempting to adjust a cable that a dog was tied to.  Dog was a German shepherd, it belonged to his brother, and the dog spontaneously attacked him.  He sustained a bite to his right lower leg.  Apparently, according to the patient, the dog is well known and is up-to-date on his shots and they wanted to confirm that.  The dog has given no prior history of any reason to believe he is not a healthy dog.  The patient himself developed a puncture wound with a flap injury.  The patient has a flap wound also below the puncture wound, a V-shaped flap, which is pointing towards the foot.  It appears to be viable.  The wound is open about may be roughly a centimeter in the inside of the flap.  He was seen by his medical primary care physician and was given a tetanus shot and the wound was cleaned and wrapped, and then he was referred to us for further assessment.\n\nPAST MEDICAL HISTORY (PMH):  Significant for history of pulmonary fibrosis and atrial fibrillation.  He is status post bilateral lung transplant back in 2004 because of the pulmonary fibrosis.\n\nALLERGIES:  There are no known allergies.\n\nMEDS:  Include multiple medications that are significant for his lung transplant including Prograf, CellCept, prednisone, omeprazole, Bactrim which he is on chronically, folic acid, vitamin D, Mag-Ox, Toprol-XL, calcium 500 mg, vitamin B1, Centrum Silver, verapamil, and digoxin.\n\nFAMILY HISTORY  Consistent with a sister of his has ovarian cancer and his father had liver cancer.  Heart disease in the patient's mother and father, and father also has diabetes and diabetic retinopathy.\n\nSOCIAL HISTORY:  He is a non-cigarette smoker.  He has occasional glass of wine.  He is married.  He has one biological child and three stepchildren.  He works for ABCD.\n\nROS:  He denies any chest pain.  He does admit to exertional shortness of breath.  He denies any GI or GU problems.  He denies any bleeding disorders.\n\nPHYSICAL EXAMINATIONGENERAL:  Presents as a well-developed, well-nourished 50-year-old white male who appears to be in mild distress.\n\nHEENT:  Unremarkable.\n\nNECK:  Supple.  There is no mass, adenopathy or bruit.\n\nCHEST:  Normal excursion.\n\nLUNGS:  Clear to auscultation and percussion.\n\nCOR:  Regular.  There is no S3 or S4 gallop.  There is no obvious murmur.\n\nABDOMEN:  Soft.  It is nontender.  Bowel sounds are present.  There is no tenderness.\n\nSKIN:  He does have like a Chevron incisional scar across his lower chest and upper abdomen.  It appears to be well healed and unremarkable.\n\nGENITALIA:  Deferred.\n\nRECTAL:  Deferred.\n\nEXTREMITIES:  He has about 1+ pitting edema to both legs and they have been present since the surgery.  In the right leg, he has an about midway between the right knee and right ankle on the anterior pretibial area, he has a puncture wound that measures about may be centimeter around that appears to be relatively clean, and just below that about may be 3 cm below, he has a flap traumatic injury that measures about may be 4 cm to the point of the flap.  The wound is spread apart about may be a centimeter all along that area and it is relatively clean.  There was some bleeding when I removed the dressing and we were able to pretty much control that with pressure and some silver nitrate.  There were exposed subcutaneous tissues, but there was no exposed tendons that we could see, etc.  The flap appeared to be viable.\n\nNEUROLOGIC:  Without focal deficits.  The patient is alert and oriented.\n\nIMPRESSION:  A 50-year-old white male with dog bite to his right leg with a history of pulmonary fibrosis, status post bilateral lung transplant several years ago.  He is on multiple medications and he is on chronic Bactrim.  We are going to also add some fluoroquinolone right now to protect the skin and probably going to obtain an Infectious Disease consult.  We will see him back in the office early next week to reassess his wound.  He is to keep the wound clean with the moist dressing right now.  He may shower several times a day."

Text processing

Approach: Keyword Window Technique

So each note has a NOTE_ID, NOTE_TYPE, and TEXT. Because some notes break up the notes by body part it would be difficult to search using note sections. It would require applying a similar search strategy to multiple note sections which could be cumbersome. I will instead utilize keyword window. I will aim to identify information around the keyword(s) of interest.

I will need to define the following:

The note type I am using
The keywords I will use
The window size I will use
The type of information I will be looking for in each window

diabetes_notes %>% nlp_example_datatable()

Note Type

I can search all note types for signs of diabetic complications but history and physical, operative notes and discharge summaries are all likely to contain relevant information.

Keywords to use

First here are the helper functions to extract the windows matching 1 or 2 keywords

extract_text_window <- function(dataframe, keyword, half_window_size) {
  dataframe %>% 
    group_by(NOTE_ID) %>% 
    mutate(WORDS = TEXT) %>% 
    separate_rows(WORDS, sep = "[ \n]+") %>% 
    mutate(INDEX = seq(from = 1, to = n(), by = 1.0),
           WINDOW_START = case_when(INDEX - half_window_size < 1 ~ 1,
                                    TRUE ~ INDEX - half_window_size),
           WINDOW_END = case_when(INDEX + half_window_size > max(INDEX) ~ max(INDEX),
                                  TRUE ~ INDEX + half_window_size),
           WINDOW = word(string = TEXT, start = WINDOW_START, end = WINDOW_END, sep = "[ \n]+")) %>% 
    ungroup() %>% 
    filter(str_detect(string = WORDS, pattern = regex(keyword, ignore_case = TRUE)))
}

extract_2words_text_window <- function(dataframe, keyword1, keyword2, half_window_size){
 dataframe %>% 
 group_by(NOTE_ID) %>% 
 mutate(WORDS = TEXT) %>% 
 separate_rows(WORDS, sep = "[ \n]+") %>% 
 mutate(INDEX = seq(from = 1, to = n(), by = 1.0),
 WINDOW_START = case_when(INDEX - half_window_size < 1 ~ 1,
 TRUE ~ INDEX - half_window_size),
 WINDOW_END = case_when(INDEX + half_window_size > max(INDEX) ~ max(INDEX),
 TRUE ~ INDEX + half_window_size),
 WINDOW = word(string = TEXT, start = WINDOW_START, end = WINDOW_END, sep = "[ \n]+")) %>% 
 ungroup() %>% 
 filter(str_detect(string = WORDS, pattern = regex(keyword1, ignore_case = TRUE)),
 str_detect(string = lead(WORDS), pattern = regex(keyword2, ignore_case = TRUE))) %>%
 mutate(WINDOW_END = WINDOW_END + 1)
}

I will try using the following words/phrases

neuropathy
nerve pain
nephropathy
kidney failure
retinopathy

Window size

I will likely need a medium sized window around my keywords. I don’t necessarily expect the explanation of the complicaiton to be directly next to the keyword, but it shouldn’t be too far away, so I will begin with a window size of 20 (half window size = 10).

Type of information I’m looking for

I will be looking for indication of current complications as a result of diabetes. This will include nerve damage like tingling or pain, kidney damage, and retinal blood vessel damage. I will make sure the information is referring to the patient about whom the note is written as well.

Return to top

Regular Expression(s)

I will be looking for the following words:

neuropathy
nerve pain
nephropathy
kidney failure
retinopathy

I will work through my use of a regular expression to catch kidney failure

Here is my regular expression:

(?<[a-zA-Z])(kidney|renal)( disease| failure)(?[a-zA-z])

This expression contains look ahead and look behind groups to make sure other terms are not preceding or following my words. For example, it prevents “adrenal” from matching renal. It then matches either kidney or renal(the medical term for kidney) followed by a space and either disease or failure.

This should catch the following:

kidney disease
kidney failure
renal disease
renal failure

Now I will test how well it works:

diabetes_notes %>%
  mutate(KIDNEY_FAILURE = case_when(str_detect(string = TEXT, pattern = regex("(?<![a-zA-Z])(kidney|renal)( disease| failure)(?![a-zA-z])", ignore_case = T)) ~ 1,
                                    TRUE ~ 0)) %>%
  filter(KIDNEY_FAILURE == 1)

## # A tibble: 19 x 4
##    NOTE_ID NOTE_TYPE       TEXT                                   KIDNEY_FAILURE
##      <int> <chr>           <chr>                                           <dbl>
##  1      12 Operative Note  "PREOPERATIVE DIAGNOSES1.  End-stage …              1
##  2      13 Operative Note  "PREOPERATIVE DIAGNOSES1.  End-stage …              1
##  3      15 History and Ph… "CHIEF COMPLAINT: Right-sided weaknes…              1
##  4      21 Discharge Summ… "DIAGNOSIS:  Refractory anemia that i…              1
##  5      27 History and Ph… "CHIEF COMPLAINT:  Penile discharge, …              1
##  6      35 Discharge Summ… "REASON FOR CONSULTATION:  Syncope.\n…              1
##  7      41 History and Ph… "REASON FOR VISIT:  Acute kidney fail…              1
##  8      42 History and Ph… "HISTORY OF PRESENT ILLNESS:  The pat…              1
##  9      48 Discharge Summ… "CHIEF COMPLAINT: Headache and pain i…              1
## 10      55 Discharge Summ… "Chief Complaint: Abdominal pain, nau…              1
## 11      62 History and Ph… "HISTORY OF PRESENT ILLNESS:  This is…              1
## 12      69 Discharge Summ… "DIAGNOSES PROBLEMS:1.  Orthostatic h…              1
## 13      82 Discharge Summ… "Chief Complaint: Back and hip pain.H…              1
## 14     108 Discharge Summ… "ADMISSION DIAGNOSIS:  End-stage rena…              1
## 15     109 History and Ph… "REASON FOR CONSULTATION:  Abnormal c…              1
## 16     112 History and Ph… "REASON FOR CONSULTATION:  Renal fail…              1
## 17     120 History and Ph… "SUBJECTIVE:  The patient is in compl…              1
## 18     123 History and Ph… "REASON FOR CONSULTATION:  Management…              1
## 19     140 History and Ph… "HISTORY OF PRESENT ILLNESS:  This 66…              1

This regex hits 19 notes. Let’s look at the contents of some of the notes.

diabetes_notes %>%
  mutate(KIDNEY_FAILURE = case_when(str_detect(string = TEXT, pattern = regex("(?<![a-zA-Z])(kidney|renal)( disease| failure)(?![a-zA-z])", ignore_case = T)) ~ 1,
                                    TRUE ~ 0)) %>%
  filter(KIDNEY_FAILURE == 1) %>% nlp_example_datatable()

I do notice that one of the notes includes negation: “He denies any comorbid complications of the diabetes including kidney disease,…”. I will have to check to remove this window.

Now I will try using the 2 keywords window extract.

diabetes_notes %>% extract_2words_text_window(keyword1 = "(?<![a-zA-Z])(kidney|renal)(?![a-zA-z])", 
                                              keyword2 = "(?<![a-zA-Z])(disease|failure)(?![a-zA-z])", 
                                              half_window_size = 10)

## # A tibble: 41 x 8
##    NOTE_ID NOTE_TYPE  TEXT       WORDS INDEX WINDOW_START WINDOW_END WINDOW     
##      <int> <chr>      <chr>      <chr> <dbl>        <dbl>      <dbl> <chr>      
##  1      12 Operative… "PREOPERA… renal     4            1         15 "PREOPERAT…
##  2      12 Operative… "PREOPERA… renal    12            2         23 "DIAGNOSES…
##  3      12 Operative… "PREOPERA… renal    27           17         38 "right bra…
##  4      13 Operative… "PREOPERA… renal     4            1         15 "PREOPERAT…
##  5      13 Operative… "PREOPERA… renal    20           10         31 "chronic a…
##  6      13 Operative… "PREOPERA… renal    61           51         72 "Michael C…
##  7      15 History a… "CHIEF CO… kidn…   250          240        261 "moderate …
##  8      21 Discharge… "DIAGNOSI… kidn…    40           30         51 "diabetes.…
##  9      27 History a… "CHIEF CO… renal   175          165        186 "as in the…
## 10      35 Discharge… "REASON F… kidn…   864          854        875 "this.4.  …
## # … with 31 more rows

diabetes_notes %>% extract_2words_text_window(keyword1 = "(?<![a-zA-Z])(kidney|renal)(?![a-zA-z])", 
                                              keyword2 = "(?<![a-zA-Z])(disease|failure)(?![a-zA-z])", 
                                              half_window_size = 10) %>%
  nlp_example_datatable()

Good, the extract function found the same 19 notes in 41 total windows.

Return to top

Performance

Now it is time to apply the keyword window text identification to identify patients who have diabetic complications of neuropathy, nephropathy, and/or retinopathy.

To achieve this I will separately extract notes with one of each of the three and then merge them.

Break the notes into windows and extract those with matching word(s), then remove negation and wrong subject windows

Neuropathy

neuropathy <- diabetes_notes %>% extract_text_window(keyword = "(?<![a-zA-Z])neuropathy(?![a-zA-z])", half_window_size = 10)

nerve_pain <- diabetes_notes %>% extract_2words_text_window(keyword1 = "(?<![a-zA-Z])(nerve)(?![a-zA-z])", 
                                              keyword2 = "(?<![a-zA-Z])(pain)(?![a-zA-z])", 
                                              half_window_size = 10)

neuropathy <- rbind(neuropathy, nerve_pain) 

neuropathy %>% nlp_example_datatable()

neuropathy %>%
  summarise(unique(NOTE_ID))

## # A tibble: 19 x 1
##    `unique(NOTE_ID)`
##                <int>
##  1                 3
##  2                 7
##  3                21
##  4                24
##  5                27
##  6                30
##  7                34
##  8                37
##  9                38
## 10                52
## 11                61
## 12                71
## 13                82
## 14                86
## 15                97
## 16               118
## 17               126
## 18               130
## 19                18

We get 19 unique notes and 29 relevant windows.

In note 21 the information that shows negation (denies comorbid complications) lies outside the window. Either need to remove by hand or increase window size.

In note 38 the neuropathy is likely due to a condition that is NOT diabetes. Will also need to exclude this one (thrombocythemia)

After looking through the rest of the notes it is not worth increasing the window size as no other notes has important information right outside the window. Will just keep in mind that this note is incorrectly classified. For now I will just deal with the second case above and exclude the windows that include neuropathy due to thrombocythemia

neuropathy_filtered <- neuropathy %>%
  mutate(EXCLUDE = case_when(str_detect(string = TEXT, pattern = regex("(?<![a-zA-Z])thrombocythemia(?![a-zA-z])", ignore_case = T)) ~ 1,
                             TRUE ~ 0)) %>%
  filter(EXCLUDE == 0)

neuropathy_filtered %>% 
  summarise(unique(NOTE_ID))

## # A tibble: 18 x 1
##    `unique(NOTE_ID)`
##                <int>
##  1                 3
##  2                 7
##  3                21
##  4                24
##  5                27
##  6                30
##  7                34
##  8                37
##  9                52
## 10                61
## 11                71
## 12                82
## 13                86
## 14                97
## 15               118
## 16               126
## 17               130
## 18                18

neuropathy_filtered %>% nlp_example_datatable()

Good that filter properly removed note #38.

Return to top of section

Nephropathy

nephropathy <- diabetes_notes %>% extract_text_window(keyword = "(?<![a-zA-Z])nephropathy(?![a-zA-z])", half_window_size = 10)

kidney_failure <- diabetes_notes %>% extract_2words_text_window(keyword1 = "(?<![a-zA-Z])(kidney|renal)(?![a-zA-z])", 
                                              keyword2 = "(?<![a-zA-Z])(disease|failure)(?![a-zA-z])", 
                                              half_window_size = 10)

nephropathy <- rbind(nephropathy, kidney_failure)

nephropathy %>%
  arrange(NOTE_ID) %>%
  summarise(unique(NOTE_ID))

## # A tibble: 21 x 1
##    `unique(NOTE_ID)`
##                <int>
##  1                 6
##  2                12
##  3                13
##  4                15
##  5                21
##  6                27
##  7                35
##  8                41
##  9                42
## 10                48
## # … with 11 more rows

nephropathy %<>% arrange(NOTE_ID)

nephropathy %>% nlp_example_datatable()

This keyword window search yielded 21 unique notes in 51 total windows.

Again with note 21 need to exclude by “denies any comorbid complications”

In note 41 the doctor write that he/she is concerned about the patients use of Chinese herbs which can cause nephritis and thinks it is “more likely that[than] diabetic nephropathy”. Should exclude this one also.

In note 48 acute renal failure was observed in patient NOT due to diabetes but instead from tumor lysis syndrome. Will exclude this as well.

In note 82 includes negation: “Negative for coronary heart disease, hypertension, diabetes, or kidney disease”

Note 109 incorrectly classified. No history of diabetes.

Need to exclude “mother and father were on dialysis”

nephropathy_filtered <- nephropathy %>% 
  mutate(EXCLUDE = case_when(str_detect(string = TEXT, pattern = regex("denies any comorbid complications", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("more likely that diabetic nephropathy", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("tumor lysis syndrome", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("negative for coronary heart disease, hypertension, diabetes, or kidney disease", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("mother and father were on dialysis", ignore_case = T)) ~ 1,
                             TRUE ~ 0)) %>%
  filter(EXCLUDE == 0)

nephropathy_filtered %>% 
  summarise(unique(NOTE_ID))

## # A tibble: 16 x 1
##    `unique(NOTE_ID)`
##                <int>
##  1                 6
##  2                12
##  3                13
##  4                15
##  5                27
##  6                35
##  7                42
##  8                51
##  9                55
## 10                62
## 11                69
## 12               108
## 13               109
## 14               120
## 15               123
## 16               140

nephropathy_filtered %>% nlp_example_datatable()

Good removed the 5 notes it should have, so now I have 16 notes (down from 21) and a total of 34 windows.

Return to top of section

Retinopathy

retinopathy <- diabetes_notes %>% extract_text_window(keyword = "(?<![a-zA-Z])retinopathy(?![a-zA-z])", half_window_size = 10)

retinopathy %>% 
  summarise(unique(NOTE_ID))

## # A tibble: 5 x 1
##   `unique(NOTE_ID)`
##               <int>
## 1                 1
## 2                21
## 3                86
## 4                94
## 5               136

retinopathy %>% nlp_example_datatable()

This keyword window search identified 5 notes in a total of 6 windows.

In note one (and window 1) states “father also has diabetes and diabetic retinopathy”. I will need to remove this one.

In note 21 there is negation “No retinopathy”.

In note 94 it talks about family history, not the current patient. “strong family history of diabetes and including diabetic complications of retinopathy”

Note 136 also has negation: “does not show any evidence of diabetic retinopathy at this time”

Now I will remove these exclusions.

retinopathy_filtered <- retinopathy %>%
  mutate(EXCLUDE = case_when(str_detect(string = TEXT, pattern = regex("father also has diabetes and diabetic retinopathy", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("no retinopathy", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("strong family history of diabetes including diabetic complications", ignore_case = T)) ~ 1,
                             str_detect(string = TEXT, pattern = regex("does not show any evidence of diabetic retinopathy", ignore_case = T)) ~ 1,
                             TRUE ~ 0)) %>%
  filter(EXCLUDE == 0)

retinopathy_filtered %>%
  summarise(unique(NOTE_ID))

## # A tibble: 1 x 1
##   `unique(NOTE_ID)`
##               <int>
## 1                86

retinopathy_filtered %>% nlp_example_datatable()

Great we have removed all but 1, as was the goal based off removing notes that were negated or referred to a different individual.

Return to top of section

All

I now want to merge the data sets to obtain the notes that are identified as cases for any (or multiple) of the three diabetic complications.

all_complications <- rbind(neuropathy_filtered,
                           rbind(nephropathy_filtered, retinopathy_filtered))

all_complications %<>% 
  arrange(NOTE_ID)

all_complications %>%
  group_by(NOTE_ID)

## # A tibble: 60 x 9
## # Groups:   NOTE_ID [33]
##    NOTE_ID NOTE_TYPE  TEXT   WORDS INDEX WINDOW_START WINDOW_END WINDOW  EXCLUDE
##      <int> <chr>      <chr>  <chr> <dbl>        <dbl>      <dbl> <chr>     <dbl>
##  1       3 Discharge… "CC: … neur…   151          141        161 She al…       0
##  2       6 Operative… "PREO… neph…    48           38         58 The pa…       0
##  3       7 Operative… "S - … Neur…   224          214        234 Planta…       0
##  4      12 Operative… "PREO… Neph…     6            1         16 PREOPE…       0
##  5      12 Operative… "PREO… renal     4            1         15 PREOPE…       0
##  6      12 Operative… "PREO… renal    12            2         23 DIAGNO…       0
##  7      12 Operative… "PREO… renal    27           17         38 right …       0
##  8      13 Operative… "PREO… renal     4            1         15 PREOPE…       0
##  9      13 Operative… "PREO… renal    20           10         31 chroni…       0
## 10      13 Operative… "PREO… renal    61           51         72 Michae…       0
## # … with 50 more rows

all_complications %>% nlp_example_datatable()

Return to top of section

Summary of performance

In order to compare it to the goldstandard, I need to include which condition was found and if any condition was found

checks <- data.frame(NOTE_ID = integer(),
                     ANY_DIABETIC_COMPLICATION = integer(),
                     DIABETIC_NEUROPATHY = integer(),
                     DIABETIC_NEPHROPATHY = integer(),
                     DIABETIC_RETINOPATHY = integer())

for (i in 1:nrow(diabetes_notes)) {
  id <- diabetes_notes$NOTE_ID[i]
  neuro <- ifelse((id %in% neuropathy_filtered$NOTE_ID), 1, 0)
  nephro <- ifelse((id %in% nephropathy_filtered$NOTE_ID), 1, 0)
  retino <- ifelse((id %in% retinopathy_filtered$NOTE_ID), 1, 0)
  any <- ifelse((neuro == 1 | nephro == 1 | retino == 1), 1, 0)
  toAdd <- c(NOTE_ID = id, 
             ANY_DIABETIC_COMPLICATION = any, 
             DIABETIC_NEUROPATHY = neuro, 
             DIABETIC_NEPHROPATHY = nephro, 
             DIABETIC_RETINOPATHY = retino)
  checks <- rbind(checks, toAdd)
}

colnames(checks) <- c("NOTE_ID", "ANY_DIABETIC_COMPLICATION", "DIABETIC_NEUROPATHY", "DIABETIC_NEPHROPATHY", "DIABETIC_RETINOPATHY")

checks %<>%
  mutate(ANY_MATCH = ifelse(goldstandard$ANY_DIABETIC_COMPLICATION == ANY_DIABETIC_COMPLICATION, 1, 0),
         NEUROPATHY_MATCH = ifelse(goldstandard$DIABETIC_NEUROPATHY == DIABETIC_NEUROPATHY, 1, 0),
         NEPHROPATHY_MATCH = ifelse(goldstandard$DIABETIC_NEPHROPATHY == DIABETIC_NEPHROPATHY, 1, 0),
         RETINOPATHY_MATCH = ifelse(goldstandard$DIABETIC_RETINOPATHY == DIABETIC_RETINOPATHY, 1, 0))
checks_nomatch <- checks %>%
  filter(ANY_MATCH == 0)

print(checks_nomatch[1:14, 1:9])

##    NOTE_ID ANY_DIABETIC_COMPLICATION DIABETIC_NEUROPATHY DIABETIC_NEPHROPATHY
## 1        3                         1                   1                    0
## 2       14                         0                   0                    0
## 3       15                         1                   0                    1
## 4       16                         0                   0                    0
## 5       21                         1                   1                    0
## 6       35                         1                   0                    1
## 7       55                         1                   0                    1
## 8       69                         1                   0                    1
## 9       82                         1                   1                    0
## 10      85                         0                   0                    0
## 11     109                         1                   0                    1
## 12     120                         1                   0                    1
## 13     123                         1                   0                    1
## 14     135                         0                   0                    0
##    DIABETIC_RETINOPATHY ANY_MATCH NEUROPATHY_MATCH NEPHROPATHY_MATCH
## 1                     0         0                0                 1
## 2                     0         0                0                 1
## 3                     0         0                1                 0
## 4                     0         0                1                 1
## 5                     0         0                0                 1
## 6                     0         0                1                 0
## 7                     0         0                1                 0
## 8                     0         0                1                 0
## 9                     0         0                0                 1
## 10                    0         0                1                 0
## 11                    0         0                1                 0
## 12                    0         0                1                 0
## 13                    0         0                1                 0
## 14                    0         0                1                 1
##    RETINOPATHY_MATCH
## 1                  1
## 2                  1
## 3                  1
## 4                  0
## 5                  1
## 6                  1
## 7                  1
## 8                  1
## 9                  1
## 10                 1
## 11                 1
## 12                 1
## 13                 1
## 14                 0

Overall I identified 33 unique notes with diabetic complications by my alogorithm. After hand review of the notes, I know for sure that 30/33 were correctly classified and 3/33 were incorrectly classified as “cases”.

Of the 141 total notes, I had 14 misclassified notes.

NOTE_ID 3: I identified as having neuropathy but it did not
NOTE_ID 14: I failed to identify as having neuropathy
NOTE_ID 15: I identified as having nephropathy but it did not
NOTE_ID 16: I failed to identify as having retinopathy
NOTE_ID 21: I identified as having neuropathy but it did not
NOTE_ID 35: I identified as having nephropathy but it did not
NOTE_ID 55: I identified as having nephropathy but it did not
NOTE_ID 69: I identified as having nephropathy but it did not
NOTE_ID 82: I identified as having neuropathy but it did not
NOTE_ID 85: I failed to identify as having nephropathy
NOTE_ID 109: I identified as having nephropathy but it did not
NOTE_ID 120: I identified as having nephropathy but it did not
NOTE_ID 123: I identified as having nephropathy but it did not
NOTE_ID 135: I failed to identify as having retinopathy

Return to top

Reflection

Overall I identified 33 notes with diabetic complications. By hand examining these results I found that 30 of these 33 were correctly identified. After using the goldstandard, I found that 14 were misclassified.

NOTE_ID 3: I identified as having neuropathy but it did not
NOTE_ID 14: I failed to identify as having neuropathy
NOTE_ID 15: I identified as having nephropathy but it did not
NOTE_ID 16: I failed to identify as having retinopathy
NOTE_ID 21: I identified as having neuropathy but it did not
NOTE_ID 35: I identified as having nephropathy but it did not
NOTE_ID 55: I identified as having nephropathy but it did not
NOTE_ID 69: I identified as having nephropathy but it did not
NOTE_ID 82: I identified as having neuropathy but it did not
NOTE_ID 85: I failed to identify as having nephropathy
NOTE_ID 109: I identified as having nephropathy but it did not
NOTE_ID 120: I identified as having nephropathy but it did not
NOTE_ID 123: I identified as having nephropathy but it did not
NOTE_ID 135: I failed to identify as having retinopathy

10 total notes were identified incorrectly 4 notes were missed

Example 1: A correctly classified note

We will look at NOTE_ID 7 which is an operative note.

all_complications %>%
  slice(3) %>% nlp_example_datatable()

This note was able to be identified through the keyword “neuropathy”. It was not excluded due to negation or belonging to another individual. This individual’s medical record clearly states a “history of diabetic neuropathy”.

Example 2: an incorrectly identified note.

We will look at note 3 which I identified as having neuropathy but did not

all_complications %>% 
  slice(1) %>% nlp_example_datatable()

Note 3 (shown above) was also identified in searching for neuropathy, but the information occurrs far outside of the window and would likely only be caught by hand review after reading the entire note. This case of neuropathy is due to a car accident, and not diabetes.

An alternative approach to correcting note 109 would be to first exclude any notes that contains the text “No history of diabetes” in any section except family history. This would remove any patients that do not have diabetes first so that we can guarantee that the individuals identified through these notes at least have diabetes. This will still require more checking that the sections identified in the keyword window search are about the individual in question and not negated, but it could be helpful in removing known non-diabetics.

I also noted that a lot of my misclassifications were in nephropathy. I believe it is because I identified end-stage renal failure, which is not the same as diabetic nephropathy. This was a mistake that would have been better avoided if I had more clinical knowledge or a clinical expert to consult with. Removing this keyword would probably reduce my false positive rate.

MiniProject3_Stoneman

Hayley Stoneman

4/11/2021

Setup

Note exploration

Text processing

Approach: Keyword Window Technique

Note Type

Keywords to use

Window size

Type of information I’m looking for

Regular Expression(s)

Performance

Break the notes into windows and extract those with matching word(s), then remove negation and wrong subject windows

Neuropathy

Nephropathy

Retinopathy

All

Summary of performance

Reflection

Example 1: A correctly classified note

Example 2: an incorrectly identified note.