Last updated: 2025-10-24

Checks: 7 0

Knit directory: genomics_ancest_disease_dispar/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20220216) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 0d8b872. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .DS_Store
    Ignored:    .Rproj.user/
    Ignored:    IHME_GBD_2019_Level_2.csv
    Ignored:    analysis/.DS_Store
    Ignored:    ancestry_dispar_env/
    Ignored:    author_manuscript.filelist.txt
    Ignored:    author_manuscript_files.txt
    Ignored:    data/.DS_Store
    Ignored:    data/cohort/
    Ignored:    data/gbd/.DS_Store
    Ignored:    data/gbd/IHME-GBD_2021_DATA-d8cf695e-1.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2019_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_disease_burden_rate_all_ages.csv
    Ignored:    data/gbd/ihme_gbd_2021_global_paf_rate_percent_all_ages.csv
    Ignored:    data/gwas_catalog/
    Ignored:    data/icd/.DS_Store
    Ignored:    data/icd/IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2019_NONFATAL_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/icd/IHME_GBD_2021_COD_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/IHME_GBD_2021_NONFATAL_CAUSE_ICD_CODE_MAP_Y2024M05D16.XLSX
    Ignored:    data/icd/UK_Biobank_master_file.tsv
    Ignored:    data/icd/cdc_valid_icd10_Sep_23_2025.xlsx
    Ignored:    data/icd/cdc_valid_icd9_Sep_23_2025.xlsx
    Ignored:    data/icd/manual_disease_icd10_mappings.xlsx
    Ignored:    data/icd/phecode_international_version_unrolled.csv
    Ignored:    data/icd/semiautomatic_ICD-pheno.txt
    Ignored:    data/icd/~$IHME_GBD_2019_COD_CAUSE_ICD_CODE_MAP_Y2020M10D15.XLSX
    Ignored:    data/who/
    Ignored:    fulltexts_pmc/
    Ignored:    output/.DS_Store
    Ignored:    output/abstracts/
    Ignored:    output/doccano/
    Ignored:    output/fulltexts/
    Ignored:    output/gwas_cat/
    Ignored:    output/gwas_cohorts/
    Ignored:    output/icd_map/
    Ignored:    output/trait_ontology/
    Ignored:    pubmedbert-cohort-ner-model/
    Ignored:    pubmedbert-cohort-ner/
    Ignored:    r-spacyr/
    Ignored:    renv/
    Ignored:    venv/

Untracked files:
    Untracked:  analysis/get_full_text.Rmd
    Untracked:  analysis/specific_aims_stats.Rmd
    Untracked:  code/full_text_download.R
    Untracked:  code/get_dbgap_ids.py
    Untracked:  code/get_pmids_from_dbgap.py
    Untracked:  test_cohort_info_overlap.R

Unstaged changes:
    Modified:   .gitignore
    Modified:   analysis/correcting_cohort_names.Rmd
    Modified:   analysis/disease_inves_by_ancest.Rmd
    Modified:   analysis/gbd_data_plots.Rmd
    Modified:   analysis/group_cancer_diseases.Rmd
    Modified:   analysis/gwas_to_gbd.Rmd
    Modified:   analysis/index.Rmd
    Modified:   analysis/level_1_disease_group_non_cancer.Rmd
    Modified:   analysis/level_2_disease_group.Rmd
    Modified:   analysis/map_trait_to_icd10.Rmd
    Modified:   analysis/trait_ontology_categorization.Rmd
    Deleted:    code/abstracts.R
    Modified:   code/pubmedbert_train_test.py

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/text_for_cohort_labels.Rmd) and HTML (docs/text_for_cohort_labels.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 0d8b872 IJbeasley 2025-10-24 Cleaning up abstract collecting code again
html 2afc108 IJbeasley 2025-10-24 Build site.
Rmd 748dac2 IJbeasley 2025-10-24 Cleaning up abstract collecting

Set up

Required packages

library(stringr)
library(readxl)
library(dplyr)
library(stringi)
library(rentrez)
library(xml2)
library(jsonlite)

# Improve sentence recognition
# library(reticulate)

# Path to the Python inside your venv
# python_path <- file.path(here::here(), "venv", "bin", "python")  # Mac/Linux
# Sys.setenv(SPACY_PYTHON = python_path)
# 
# use_python(Sys.getenv("SPACY_PYTHON", 
#                                       unset = "r-spacyr"), required = TRUE)
# 
# # Check configuration
# py_config()
# 
# py_module_available("spacy")
# 
# library(spacyr)
# spacy_initialize(model = "en_core_web_sm")
# library(spacyr)
# reticulate::virtualenv_create("r-spacyr", python = python_exe)
# spacy_install(version = "apple")
# spacy_download_langmodel("en_core_web_sm")
library(tokenizers)

Getting required information from GWAS catalog

Step 1: Get only disease studies, and select relevant columns

## Step 1: 
# get only disease studies
gwas_study_info <- data.table::fread(here::here("output/gwas_cat/gwas_study_info_trait_group_l2.csv"))

gwas_study_info = gwas_study_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

gwas_study_info =
  gwas_study_info |>
  dplyr::filter(DISEASE_STUDY == T) |>
  dplyr::select(-COHORT)

Step 2: Get cohort information

gwas_study_info_cohort =
  data.table::fread(here::here("output/gwas_cohorts/gwas_cohort_name_corrected.csv"))

gwas_study_info_cohort =
  gwas_study_info_cohort |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

gwas_study_info_cohort =
  gwas_study_info_cohort |>
  select(STUDY_ACCESSION,
         COHORT) |>
  distinct()

gwas_study_info =
  left_join(gwas_study_info,
            gwas_study_info_cohort,
            by = "STUDY_ACCESSION"
  )

Step 3: Add ancestry/population info

gwas_ancest_info <-  data.table::fread(here::here("data/gwas_catalog/gwas-catalog-v1.0.3.1-ancestries-r2025-07-21.tsv"),
                           sep = "\t",
                           quote = "")

gwas_ancest_info = gwas_ancest_info |>
  dplyr::rename_with(~ gsub(" ", "_", .x))

gwas_ancest_info =
  gwas_ancest_info |>
  select(STUDY_ACCESSION,
         BROAD_ANCESTRAL_CATEGORY,
         COUNTRY_OF_RECRUITMENT) |>
  distinct() |>
  group_by(STUDY_ACCESSION) |>
  summarise(
    BROAD_ANCESTRAL_CATEGORY = paste(
      unique(
        unlist(strsplit(BROAD_ANCESTRAL_CATEGORY, split = "\\|"))
      ),
      collapse = "|"
    ),
    COUNTRY_OF_RECRUITMENT = paste(
      unique(
        unlist(strsplit(COUNTRY_OF_RECRUITMENT, split = "\\|"))
      ),
      collapse = "|"
    )
  )

gwas_study_info =
  left_join(gwas_study_info,
            gwas_ancest_info,
            by = "STUDY_ACCESSION"
  )

Step 4: Extract information required to get paper text & abstracts

gwas_study_info <-
  gwas_study_info |>
  #filter(COHORT != "") |>
  select(PUBMED_ID,
         COHORT,
         DATE,
         BROAD_ANCESTRAL_CATEGORY,
         COUNTRY_OF_RECRUITMENT) |>
  distinct() |>
  group_by(PUBMED_ID,
           DATE,
           BROAD_ANCESTRAL_CATEGORY,
           COUNTRY_OF_RECRUITMENT) |>
  summarise(
    COHORT = paste(
      unique(
        unlist(strsplit(COHORT, split = "\\|"))
      ),
      collapse = "|"
    )
  )
`summarise()` has grouped output by 'PUBMED_ID', 'DATE',
'BROAD_ANCESTRAL_CATEGORY'. You can override using the `.groups` argument.
pmids = gwas_study_info$PUBMED_ID
cohort = gwas_study_info$COHORT
date = gwas_study_info$DATE
country = gwas_study_info$COUNTRY_OF_RECRUITMENT
ancestry = gwas_study_info$BROAD_ANCESTRAL_CATEGORY

# how many papers without cohort information
gwas_study_info |> 
  filter(COHORT == "") |>
  nrow()
[1] 4006
# papers for which there is cohort information
gwas_study_info |>
  filter(COHORT != "") |>
  nrow()
[1] 1626

Just take a sample:

# Example PMIDs
rows_with_cohort = which(gwas_study_info$COHORT != "")
rows_without_cohort = which(gwas_study_info$COHORT == "")

rows = 
  c(rows_with_cohort, 
    sample(rows_without_cohort, length(rows_with_cohort))
    )

pmids = pmids[rows]

cohort = cohort[rows]
names(cohort) = pmids

date = date[rows]
names(date) = pmids

country = country[rows]
names(country) = pmids

ancestry = ancestry[rows]
names(ancestry) = pmids

Getting cohort names

# this xlsx was built from looking at acrynyms / cohort names in the gwas catalog
# and finding the corresponding full names / details of cohorts

cohort_names <- readxl::read_xlsx(here::here("data/cohort/cohort_desc.xlsx"),
                                 sheet = 1)
New names:
• `` -> `...13`
cohort_full_names = cohort_names$full_name[!is.na(cohort_names$full_name)]
cohort_full_names <- str_trim(cohort_full_names)
cohort_full_names <- iconv(cohort_full_names, to = "UTF-8")
cohort_full_names <- gsub("[\u00A0\r\n]", " ", cohort_full_names)  # replace non-breaking spaces, CR, LF with space
cohort_full_names <- str_squish(cohort_full_names)  # trims and removes extra spaces
# sort by length of name (longest first) to match longest names first
cohort_full_names <- cohort_full_names[order(-nchar(cohort_full_names))]
cohort_full_names <- cohort_full_names[cohort_full_names != "Not Reported"]


cohort_abbr_names = cohort_names$cohort[!is.na(cohort_names$cohort)]
cohort_abbr_names <- str_trim(cohort_abbr_names)
cohort_abbr_names <- iconv(cohort_abbr_names, to = "UTF-8")
cohort_abbr_names <- gsub("[\u00A0\r\n]", " ", cohort_abbr_names)  # replace non-breaking spaces, CR, LF with space
cohort_abbr_names <- str_squish(cohort_abbr_names)  # trims and removes extra spaces
# remove abbreviations that are too short
cohort_abbr_names <- cohort_abbr_names[nchar(cohort_abbr_names) >= 4]
small_abbr_to_keep <- c("C4D", 
                        "BBJ", 
                        "UKB", 
                        "MVP", 
                        "TWB", 
                        "QBB",
                        "MEC"
                        )
cohort_abbr_names <- unique(c(cohort_abbr_names, 
                              small_abbr_to_keep
                              ))
cohort_abbr_names <- cohort_abbr_names[!grepl("?", cohort_abbr_names)]
# sort by length of name (longest first) to match longest names first
cohort_abbr_names <- cohort_abbr_names[order(-nchar(cohort_abbr_names))]

Relevant abstract text

Get abstracts from Entrez

set_entrez_key(Sys.getenv('NCBI_API_KEY'))

get_pubmed_abstracts <- function(pmids, 
                                 batch_size = 200, 
                                 verbose = TRUE) {
  
  n <- length(pmids)
  
  abstracts <- setNames(rep("MISSING", 
                            n), 
                        pmids
                        )  # initialize result
  
  # Split PMIDs into batches
  batches <- split(pmids, 
                   ceiling(seq_along(pmids)/batch_size)
                   )
  
  for(i in seq_along(batches)) {
    
    batch_pmids <- batches[[i]]
    
    if(verbose) message(sprintf("Fetching batch %d of %d (%d PMIDs)...", 
                                i, 
                                length(batches), 
                                length(batch_pmids)
                                )
                        )
    
    # Fetch XML
    xml_data <- entrez_fetch(db = "pubmed", 
                             id = paste(batch_pmids, 
                                        collapse = ","), 
                             rettype = "xml", 
                             parsed = FALSE
                             )
    
    doc <- read_xml(xml_data)
    articles <- xml_find_all(doc, ".//PubmedArticle")
    
    for(article in articles) {
      
      pmid_node <- xml_find_first(article, 
                                  ".//PMID")
      
      pmid <- xml_text(pmid_node)
      
      abstract_nodes <- xml_find_all(article, 
                                     ".//AbstractText")
      
      if(length(abstract_nodes) > 0) {
        abstracts[pmid] <- paste(xml_text(abstract_nodes), 
                                 collapse = " ")
      }
    }
  }
  
  return(abstracts)
}

pmids = unique(pmids)
abstracts <- get_pubmed_abstracts(pmids)

pmids <- pmids[abstracts != "MISSING"]
date <- date[abstracts != "MISSING"]
cohort <- cohort[abstracts != "MISSING"]
abstracts <- abstracts[abstracts != "MISSING"]

# Loop through abstracts and write each to a file
for (i in seq_along(abstracts)) {
  file_name <- paste0(here::here("output/abstracts/"), pmids[i], ".txt")
  writeLines(abstracts[i], file_name)
}

Check: What abstracts are missing?

abstract_files <- list.files(here::here("output/abstracts/"), 
                             pattern = "*.txt", 
                             full.names = FALSE
                             )

pmids_with_abstracts = gsub("\\.txt$", "", abstract_files)

all_pmids = gwas_study_info$PUBMED_ID |> unique()

missing_abstracts = setdiff(all_pmids,
                             pmids_with_abstracts)

print("Number of missing abstracts:")
[1] "Number of missing abstracts:"
length(missing_abstracts)
[1] 42
library(openalexR)

# get information on these papers from openAlex
oa_example <- 
oa_fetch(entity = "works",
         pmid = missing_abstracts,
         abstract = TRUE)

oa_example |>
  select(doi, abstract) |>
  distinct()
# A tibble: 41 × 2
   doi                                             abstract                     
   <chr>                                           <chr>                        
 1 https://doi.org/10.1038/s41431-020-0636-6        <NA>                        
 2 https://doi.org/10.1038/s41586-022-04826-7       <NA>                        
 3 https://doi.org/10.1038/mp.2009.107              <NA>                        
 4 https://doi.org/10.1016/j.jaci.2015.01.047       <NA>                        
 5 https://doi.org/10.1002/art.23603                <NA>                        
 6 https://doi.org/10.1016/j.jaci.2010.06.051       <NA>                        
 7 https://doi.org/10.1176/appi.ajp.2013.12091228   <NA>                        
 8 https://doi.org/10.1164/rccm.202206-1139le      "\"Identification of a Genet…
 9 https://doi.org/10.1161/circgenetics.111.960989  <NA>                        
10 https://doi.org/10.4088/jcp.15l10127            "Article Abstract Because th…
# ℹ 31 more rows

Read in abstracts from files

abstracts <- sapply(abstract_files,
                    function(file) {
                      readLines(here::here(paste0("output/abstracts/",file)), 
                                warn = FALSE) |> 
                        paste(collapse = " ")
                    }
                    )

pmids = pmids_with_abstracts 
cohort = cohort[pmids]
date = date[pmids]
country = country[pmids]

Extract sentences from abstracts with cohort names

extract_cohort_sentences <- function(text_vector, 
                                     cohort_names, 
                                     ignore_case) {
  
  
  results <- lapply(seq_along(text_vector), function(i) {
    
    abstract <- text_vector[i]
    # Split abstract into sentences
    sentences <- unlist(tokenize_sentences(abstract))

    # For each sentence, find all matching cohort names
    lapply(seq_along(sentences), function(s) {
      
      sentence <- sentences[s]
      
      # Identify cohort names present in this sentence
      matched_cohorts <- cohort_names[str_detect(sentence, 
                                                 regex(cohort_names, 
                                                       ignore_case = ignore_case))]

      data.frame(
        abstract_id = i,
        sentence_id = s,
        sentence = sentence,
        has_cohort = length(matched_cohorts) > 0,
        COHORT = if (length(matched_cohorts) > 0) str_flatten(unique(matched_cohorts), collapse = "|", na.rm = T) else "",
        stringsAsFactors = FALSE
      )
    }) |> bind_rows()
  })

  bind_rows(results)
}

cohort_sentences_df <- extract_cohort_sentences(abstracts,
                                                cohort_full_names,
                                                ignore_case = TRUE
                                                )

cohort_sentences_df_p2 = extract_cohort_sentences(abstracts,
                                                  cohort_abbr_names,
                                                  ignore_case = FALSE
                                                  )

cohort_sentences_df =
  bind_rows(cohort_sentences_df,
            cohort_sentences_df_p2
            )

cohort_sentences_df = 
  cohort_sentences_df |>
  distinct()

Correct for the over-representation of sentences without cohort names

# set the ratio of sentences with/without cohort names
# ratio = 1
# 
# # number of sentences with cohort names
# n_tp_sentences = nrow(cohort_sentences_df |>
#                       filter(COHORT != "")
#                       )
# cohort_sentences_df  =
#   cohort_sentences_df |>
#   mutate(has_cohort = ifelse(COHORT == "", FALSE, TRUE)) |>
#   group_by(has_cohort) |>
#   slice_sample(n = ratio*n_tp_sentences,
#                replace = FALSE
#                ) 

filtered_cohort_sentences_df =
  cohort_sentences_df |>
  filter(COHORT != "")

control_cohort_sentences_df <-
cohort_sentences_df |>
  filter(grepl("cohort|consortium|study|population|registry|biobank|corsortia", 
               sentence, 
               ignore.case = TRUE)
         ) |>
  slice_sample(n = 1000)

filtered_cohort_sentences_df =
  bind_rows(filtered_cohort_sentences_df,
            control_cohort_sentences_df
            ) |>
  distinct()

filtered_cohort_sentences_df =
  filtered_cohort_sentences_df |>
  group_by(abstract_id, sentence_id, sentence) |>
  summarise(
    COHORT = str_flatten(unique(COHORT), collapse = "|", na.rm = T)
  ) |>
  ungroup()
`summarise()` has grouped output by 'abstract_id', 'sentence_id'. You can
override using the `.groups` argument.
abstract_ids = filtered_cohort_sentences_df$abstract_id |> unique() |> sort()

pmids = pmids[abstract_ids]
abstracts = abstracts[abstract_ids]
date = date[abstract_ids]
cohort = cohort[abstract_ids]

Prepare data in Doccano JSON format

# file path for intermediate json file output
json_file = here::here("output/doccano/abstracts_with_cohort_info.json")

# file path for final jsonl file output
jsonl_file = here::here("output/doccano/abstracts_with_cohort_info.jsonl")

convert_to_doccano_json_sentence_level <- function(pmids,
                                                   date,
                                                   cohort,
                                                   country,
                                                   cohort_sentences_df) {
  
  # set up json list
  doccano_list <- list()
  example_id <- 1

  for(current_sentence in cohort_sentences_df$sentence) {

      # Filter cohort sentences that match this sentence (safe matching)
      df <- cohort_sentences_df |>
             dplyr::filter(sentence == current_sentence) 
      
      # abstract_id <- df$abstract_id
      # matched_cohort <- df$COHORT
      
      for (i in seq_len(nrow(df))) {
        
        matched_cohort <- df$COHORT[i]
        abstract_id <- df$abstract_id[i]
        
        #browser()
      
        if(matched_cohort == ""){
        
          doccano_list[[example_id]] <- list(
          text = current_sentence,
          pubmed_id = pmids[abstract_id],
          date = date[abstract_id],
          country = country[abstract_id],
          gwas_cat_cohort_label = cohort[abstract_id],
          label = list()
          )
        
        } else {
        
        # Find the location of all matches of the cohort name in the sentence
        if(grepl("\\|", matched_cohort)) {
           
           # If multiple cohort names, separate
           matched_cohort <- unlist(
                                    strsplit(matched_cohort, 
                                             split = "\\|")
                                    )
        
           match_locations <- list()
        
           for(current_matched_cohort in matched_cohort){
          
               matches <- str_locate_all(current_sentence,
                                         fixed(current_matched_cohort, 
                                               ignore_case = TRUE)
                                         )[[1]]
          
               match_locations <- append(match_locations, 
                                         list(matches)
                                         )
          
           }
        
           # Combine all match locations into a single matrix
           matches <- do.call(rbind, match_locations)
        
        }  else {
        
           matches <- str_locate_all(current_sentence,
                                     fixed(matched_cohort, 
                                           ignore_case = TRUE)
                                     )[[1]]
      
      }
      
      # Convert matches to 0-based indexing (for doccano)
      matches[, "start"] <- matches[, "start"] - 1  
      
      # Turn match locations into entity list
      entities <- list()
      
      for(k in seq_len(nrow(matches))) {
              entities <- append(entities, list(list(
                start_offset = matches[k, "start"],
                end_offset = matches[k, "end"],
                label = "COHORT"
              )))
      }
      
      # Create Doccano JSON entry
      doccano_list[[example_id]] <- list(
        # id = example_id,
        text = current_sentence,
        pubmed_id = pmids[abstract_id],
        date = date[abstract_id],
        country = country[abstract_id],
        gwas_cat_cohort_label = cohort[abstract_id],
        label = entities
      )
      
      }
      }

      example_id <- example_id + 1
  }
  
  
# Suppose each element of json_list has 'labels' as a list of named lists
  doccano_list <- lapply(doccano_list, function(x) {
  x$label <- lapply(x$label, function(l) {
    # convert named list to vector/list format [start, end, label]
    c(l$start_offset, l$end_offset, l$label)
  })
  x
})

  return(doccano_list)
}




# Create JSON as a list
json_list <- convert_to_doccano_json_sentence_level(pmids = pmids,
                                                    date = date,
                                                    cohort = cohort,
                                                    country = country,
                                                    filtered_cohort_sentences_df)



# Write to JSON file
writeLines(toJSON(json_list,
                  auto_unbox = TRUE,
                  pretty = TRUE),
           json_file)


json_data <- fromJSON(json_file,
                      simplifyVector = FALSE)

# Open connection to JSONL file
con <- file(jsonl_file, "w")

# Loop over each element (object) and write as one line
for (i in seq_along(json_data)) {
  writeLines(toJSON(json_data[[i]], auto_unbox = TRUE), con)
}

# Close connection
close(con)

cat("JSONL saved to:", jsonl_file, "\n")
JSONL saved to: /Users/ibeasley/code/genomics_ancest_disease_dispar/output/doccano/abstracts_with_cohort_info.jsonl 

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] openalexR_2.0.1  tokenizers_0.3.0 jsonlite_2.0.0   xml2_1.4.0      
 [5] rentrez_1.2.4    stringi_1.8.7    dplyr_1.1.4      readxl_1.4.5    
 [9] stringr_1.5.2    workflowr_1.7.1 

loaded via a namespace (and not attached):
 [1] compiler_4.3.1    renv_1.0.3        promises_1.3.3    tidyselect_1.2.1 
 [5] Rcpp_1.1.0        git2r_0.36.2      callr_3.7.6       later_1.4.4      
 [9] jquerylib_0.1.4   yaml_2.3.10       fastmap_1.2.0     here_1.0.1       
[13] R6_2.6.1          SnowballC_0.7.1   generics_0.1.4    curl_7.0.0       
[17] knitr_1.50        XML_3.99-0.19     tibble_3.3.0      rprojroot_2.1.0  
[21] bslib_0.9.0       pillar_1.11.1     rlang_1.1.6       utf8_1.2.6       
[25] cachem_1.1.0      httpuv_1.6.16     xfun_0.53         getPass_0.2-4    
[29] fs_1.6.6          sass_0.4.10       cli_3.6.5         withr_3.0.2      
[33] magrittr_2.0.4    ps_1.9.1          digest_0.6.37     processx_3.8.6   
[37] rstudioapi_0.17.1 lifecycle_1.0.4   vctrs_0.6.5       data.table_1.17.8
[41] evaluate_1.0.5    glue_1.8.0        whisker_0.4.1     cellranger_1.1.0 
[45] rmarkdown_2.30    httr_1.4.7        tools_4.3.1       pkgconfig_2.0.3  
[49] htmltools_0.5.8.1