Last updated: 2023-11-01

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 3fc037e. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    analysis/cbioportal_cache/
    Ignored:    r_packages_4.3.1/
    Ignored:    r_packages_4.3.2/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/complex_heatmap.Rmd
    Untracked:  analysis/sleuth.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  code/multiz100way/
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  data/test
    Untracked:  export/davetang039sblog.WordPress.2023-06-30.xml
    Untracked:  export/output/
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/gdc.Rmd) and HTML (docs/gdc.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 3fc037e Dave Tang 2023-11-01 Using the GenomicDataCommons package

Introduction

About the GDC:

The National Cancer Institute’s (NCI’s) Genomic Data Commons (GDC) is a data sharing platform that promotes precision medicine in oncology. It is not just a database or a tool; it is an expandable knowledge network supporting the import and standardisation of genomic and clinical data from cancer research programs. The GDC contains NCI-generated data from some of the largest and most comprehensive cancer genomic datasets, including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Therapies (TARGET). For the first time, these datasets have been harmonised using a common set of bioinformatics pipelines, so that the data can be directly compared. As a growing knowledge system for cancer, the GDC also enables researchers to submit data, and harmonises these data for import into the GDC. As more researchers add clinical and genomic data to the GDC, it will become an even more powerful tool for making discoveries about the molecular basis of cancer that may lead to better care for patients.

The GenomicDataCommons Bioconductor package provides basic infrastructure for querying, accessing, and mining genomic datasets available from the GDC.

See The GDC API page.

Installation

Install the GenomicDataCommons package using BiocManager.

if (! "GenomicDataCommons" %in% installed.packages()[, 1]){
  BiocManager::install("GenomicDataCommons")
}
library(GenomicDataCommons)
packageVersion("GenomicDataCommons")
[1] '1.26.0'

Getting started

Check status to see if we can query the GDC.

GenomicDataCommons::status()
$commit
[1] "023da73eee3c17608db1a9903c82852428327b88"

$data_release
[1] "Data Release 38.0 - August 31, 2023"

$status
[1] "OK"

$tag
[1] "5.0.6"

$version
[1] 1
stopifnot(GenomicDataCommons::status()$status=="OK")

The following code builds a manifest that can be used to guide the download of raw data. Here, filtering finds open gene expression files quantified as raw counts using STAR from TCGA ovarian cancer patients.

ge_manifest <- files() %>%
  filter(cases.project.project_id == 'TCGA-OV') %>% 
  filter(type == 'gene_expression' ) %>%
  filter(access == 'open') %>%
  filter(analysis.workflow_type == 'STAR - Counts')  %>%
  manifest()

DT::datatable(ge_manifest)

The gdcdata function is used to download GDC files.

fnames <- lapply(ge_manifest$id[1:3], gdcdata)
fnames
[[1]]
                                                                                                          96aca0af-a776-460d-95ff-87e364e4ac99 
"~/.cache/GenomicDataCommons/96aca0af-a776-460d-95ff-87e364e4ac99/21ff9928-00f0-4b96-8d70-35e9bfad5d40.rna_seq.augmented_star_gene_counts.tsv" 

[[2]]
                                                                                                          b668c86b-fa56-4d39-9529-5b47081a3faa 
"~/.cache/GenomicDataCommons/b668c86b-fa56-4d39-9529-5b47081a3faa/41bdbd88-b4b2-4884-8a44-b34656ae4156.rna_seq.augmented_star_gene_counts.tsv" 

[[3]]
                                                                                                          60678f17-e3d7-40cd-99ff-73706497968a 
"~/.cache/GenomicDataCommons/60678f17-e3d7-40cd-99ff-73706497968a/03c8e4fe-1e07-4ea3-a154-c17c2e8af508.rna_seq.augmented_star_gene_counts.tsv" 

Files are downloaded and stored in the directory specified by gdc_cache().

gdc_cache()
[1] "~/.cache/GenomicDataCommons"

Tally the total number of available STAR gene counts that are open for download.

open_star_manifest <- files() %>%
    filter(analysis.workflow_type == 'STAR - Counts') %>%
    filter(access == 'open') %>%
    manifest()

dim(open_star_manifest)
[1] 23111    16

Metadata queries

Queries in the GenomicDataCommons package follow the four metadata endpoints available at the GDC; there are four convenience functions that each create GDCQuery objects:

  1. projects()
  2. cases()
  3. files()
  4. annotations()

Four endpoints: projects, cases, files, and annotations that have various associated fields. These are the default fields.

endpoints <- c("projects", "cases", "files", "annotations")
sapply(endpoints, default_fields)
$projects
 [1] "dbgap_accession_number" "disease_type"           "intended_release_date" 
 [4] "name"                   "primary_site"           "project_autocomplete"  
 [7] "project_id"             "releasable"             "released"              
[10] "state"                 

$cases
 [1] "aliquot_ids"              "analyte_ids"             
 [3] "case_autocomplete"        "case_id"                 
 [5] "consent_type"             "created_datetime"        
 [7] "days_to_consent"          "days_to_lost_to_followup"
 [9] "diagnosis_ids"            "disease_type"            
[11] "index_date"               "lost_to_followup"        
[13] "portion_ids"              "primary_site"            
[15] "sample_ids"               "slide_ids"               
[17] "state"                    "submitter_aliquot_ids"   
[19] "submitter_analyte_ids"    "submitter_diagnosis_ids" 
[21] "submitter_id"             "submitter_portion_ids"   
[23] "submitter_sample_ids"     "submitter_slide_ids"     
[25] "updated_datetime"        

$files
 [1] "access"                         "acl"                           
 [3] "average_base_quality"           "average_insert_size"           
 [5] "average_read_length"            "channel"                       
 [7] "chip_id"                        "chip_position"                 
 [9] "contamination"                  "contamination_error"           
[11] "created_datetime"               "data_category"                 
[13] "data_format"                    "data_type"                     
[15] "error_type"                     "experimental_strategy"         
[17] "file_autocomplete"              "file_id"                       
[19] "file_name"                      "file_size"                     
[21] "imaging_date"                   "magnification"                 
[23] "md5sum"                         "mean_coverage"                 
[25] "msi_score"                      "msi_status"                    
[27] "pairs_on_diff_chr"              "plate_name"                    
[29] "plate_well"                     "platform"                      
[31] "proc_internal"                  "proportion_base_mismatch"      
[33] "proportion_coverage_10x"        "proportion_coverage_10X"       
[35] "proportion_coverage_30x"        "proportion_coverage_30X"       
[37] "proportion_reads_duplicated"    "proportion_reads_mapped"       
[39] "proportion_targets_no_coverage" "read_pair_number"              
[41] "revision"                       "stain_type"                    
[43] "state"                          "state_comment"                 
[45] "submitter_id"                   "tags"                          
[47] "total_reads"                    "tumor_ploidy"                  
[49] "tumor_purity"                   "type"                          
[51] "updated_datetime"               "wgs_coverage"                  

$annotations
 [1] "annotation_autocomplete" "annotation_id"          
 [3] "case_id"                 "case_submitter_id"      
 [5] "category"                "classification"         
 [7] "created_datetime"        "entity_id"              
 [9] "entity_submitter_id"     "entity_type"            
[11] "legacy_created_datetime" "legacy_updated_datetime"
[13] "notes"                   "state"                  
[15] "status"                  "submitter_id"           
[17] "updated_datetime"       

Available fields for each endpoint.

all_fields <- sapply(endpoints, available_fields)
names(all_fields) <- endpoints

sapply(all_fields, length)
   projects       cases       files annotations 
         22        1001        1022          30 

These fields can be used for filtering purposes.

head(all_fields$files)
[1] "access"                      "acl"                        
[3] "analysis.analysis_id"        "analysis.analysis_type"     
[5] "analysis.created_datetime"   "analysis.input_files.access"

Use the facet function to aggregate on values used for a particular field.

files() %>% facet("access") %>% aggregations()
$access
  doc_count        key
1    678416 controlled
2    325331       open

Use grep to search for fields of interest, for example “project”.

grep("project", all_fields$files, ignore.case = TRUE, value = TRUE)
 [1] "cases.project.dbgap_accession_number"        
 [2] "cases.project.disease_type"                  
 [3] "cases.project.intended_release_date"         
 [4] "cases.project.name"                          
 [5] "cases.project.primary_site"                  
 [6] "cases.project.program.dbgap_accession_number"
 [7] "cases.project.program.name"                  
 [8] "cases.project.program.program_id"            
 [9] "cases.project.project_id"                    
[10] "cases.project.releasable"                    
[11] "cases.project.released"                      
[12] "cases.project.state"                         
[13] "cases.tissue_source_site.project"            

Look for “days_to_collection”.

grep("collection", all_fields$cases, ignore.case = TRUE, value = TRUE)
[1] "samples.days_to_collection"     "samples.tissue_collection_type"

Look for “workflow_type”.

grep("workflow_type", all_fields$cases, ignore.case = TRUE, value = TRUE)
[1] "files.analysis.metadata.read_groups.read_group_qcs.workflow_type"
[2] "files.analysis.workflow_type"                                    
[3] "files.downstream_analyses.workflow_type"                         

Look for “treatment”.

grep("treatment", all_fields$cases, ignore.case = TRUE, value = TRUE)
 [1] "diagnoses.prior_treatment"                         
 [2] "diagnoses.treatments.chemo_concurrent_to_radiation"
 [3] "diagnoses.treatments.created_datetime"             
 [4] "diagnoses.treatments.days_to_treatment_end"        
 [5] "diagnoses.treatments.days_to_treatment_start"      
 [6] "diagnoses.treatments.initial_disease_status"       
 [7] "diagnoses.treatments.number_of_cycles"             
 [8] "diagnoses.treatments.reason_treatment_ended"       
 [9] "diagnoses.treatments.regimen_or_line_of_therapy"   
[10] "diagnoses.treatments.route_of_administration"      
[11] "diagnoses.treatments.state"                        
[12] "diagnoses.treatments.submitter_id"                 
[13] "diagnoses.treatments.therapeutic_agents"           
[14] "diagnoses.treatments.treatment_anatomic_site"      
[15] "diagnoses.treatments.treatment_arm"                
[16] "diagnoses.treatments.treatment_dose"               
[17] "diagnoses.treatments.treatment_dose_units"         
[18] "diagnoses.treatments.treatment_effect"             
[19] "diagnoses.treatments.treatment_effect_indicator"   
[20] "diagnoses.treatments.treatment_frequency"          
[21] "diagnoses.treatments.treatment_id"                 
[22] "diagnoses.treatments.treatment_intent_type"        
[23] "diagnoses.treatments.treatment_or_therapy"         
[24] "diagnoses.treatments.treatment_outcome"            
[25] "diagnoses.treatments.treatment_type"               
[26] "diagnoses.treatments.updated_datetime"             
[27] "follow_ups.diabetes_treatment_type"                
[28] "follow_ups.haart_treatment_indicator"              
[29] "follow_ups.immunosuppressive_treatment_type"       
[30] "follow_ups.reflux_treatment_type"                  
[31] "follow_ups.risk_factor_treatment"                  

Note that each entry above is separated by a period (.); this indicates the hierarchical structure. Summarise the top level fields by using sub.

unique(sub("^(\\w+)\\..*", "\\1", all_fields$cases))
 [1] "aliquot_ids"              "analyte_ids"             
 [3] "annotations"              "case_autocomplete"       
 [5] "case_id"                  "consent_type"            
 [7] "created_datetime"         "days_to_consent"         
 [9] "days_to_lost_to_followup" "demographic"             
[11] "diagnoses"                "diagnosis_ids"           
[13] "disease_type"             "exposures"               
[15] "family_histories"         "files"                   
[17] "follow_ups"               "index_date"              
[19] "lost_to_followup"         "portion_ids"             
[21] "primary_site"             "project"                 
[23] "sample_ids"               "samples"                 
[25] "slide_ids"                "state"                   
[27] "submitter_aliquot_ids"    "submitter_analyte_ids"   
[29] "submitter_diagnosis_ids"  "submitter_id"            
[31] "submitter_portion_ids"    "submitter_sample_ids"    
[33] "submitter_slide_ids"      "summary"                 
[35] "tissue_source_site"       "updated_datetime"        

All aggregations are only on one field at a time.

files() %>% facet(c("type", "data_format")) %>% aggregations()
$data_format
   doc_count               key
1     188265               tsv
2     184432               vcf
3     163225               maf
4     149745               bam
5     123119               txt
6      52733             bedpe
7      32898               svs
8      32708              idat
9      24236               cel
10     24002           bcr xml
11     11324               pdf
12     10755       bcr ssf xml
13      2884 bcr auxiliary xml
14      1051       bcr omf xml
15       805          cdc json
16       602        bcr biotab
17       568       bcr pps xml
18       215         jpeg 2000
19        74               mex
20        70              xlsx
21        36              hdf5

$type
   doc_count                           key
1     197177    annotated_somatic_mutation
2     149745                 aligned_reads
3      98319          structural_variation
4      94773       simple_somatic_mutation
5      71861           copy_number_segment
6      69806          copy_number_estimate
7      46580               gene_expression
8      34661   aggregated_somatic_mutation
9      34408              mirna_expression
10     33113                   slide_image
11     32708      masked_methylation_array
12     26978        biospecimen_supplement
13     24236    submitted_genotyping_array
14     23135     simple_germline_variation
15     16657       masked_somatic_mutation
16     16354        methylation_beta_value
17     13898           clinical_supplement
18     11324              pathology_report
19      7906            protein_expression
20       108 secondary_expression_analysis

Aggregate on a sub-field.

cases() %>% facet("diagnoses.treatments.treatment_type") %>% aggregations()
$diagnoses.treatments.treatment_type
   doc_count                                          key
1      12170                       radiation therapy, nos
2      11994                  pharmaceutical therapy, nos
3        470                                 chemotherapy
4        520        stem cell transplantation, autologous
5        299                                 surgery, nos
6        171                   targeted molecular therapy
7        168           immunotherapy (including vaccines)
8         96                     radiation, external beam
9         53                      brachytherapy, low dose
10        38                              hormone therapy
11        33                     brachytherapy, high dose
12        14        stem cell transplantation, allogeneic
13         9                   radiation, 2d conventional
14         7                      radiation, 3d conformal
15         6  radiation, intensity-modulated radiotherapy
16         5      radiation, stereotactic/gamma knife/srs
17         3                    stereotactic radiosurgery
18         1                     ablation, radiofrequency
19         1                      external beam radiation
20         1 peptide receptor radionuclide therapy (prrt)
21         1                       radiation, proton beam
22     76248                                     _missing

Facet on open analysis.workflow_type.

files() %>%
  filter(access == 'open') %>%
  facet("analysis.workflow_type") %>%
  aggregations()
$analysis.workflow_type
   doc_count                                                  key
1      49062                   SeSAMe Methylation Beta Estimation
2      45258                                              DNAcopy
3      34408                                BCGSC miRNA Profiling
4      23164                                               ASCAT2
5      23111                                        STAR - Counts
6      21264                                               ASCAT3
7      16522 Aliquot Ensemble Somatic Variant Merging and Masking
8      10677                                    ABSOLUTE LiftOver
9       8776                                             AscatNGS
10       108                                Seurat - 10x Chromium
11        38                          CellRanger - 10x Raw Counts
12        36                     CellRanger - 10x Filtered Counts
13     92907                                             _missing

Facet on open experimental_strategy.

files() %>%
  filter(access == 'open') %>%
  facet("experimental_strategy") %>%
  aggregations()
$experimental_strategy
   doc_count                         key
1     100363            Genotyping Array
2      49062           Methylation Array
3      34408                   miRNA-Seq
4      23111                     RNA-Seq
5      21348                Tissue Slide
6      16075                         WXS
7      11765            Diagnostic Slide
8       8776                         WGS
9       7906 Reverse Phase Protein Array
10       447         Targeted Sequencing
11       182                   scRNA-Seq
12     51888                    _missing

Files

All BAM files are under controlled access.

files() %>%
  filter(data_format == 'bam') %>%
  facet("access") %>%
  aggregations()
$access
  doc_count        key
1    149745 controlled

All VCF files are also under controlled access.

files() %>%
  filter(data_format == 'vcf') %>%
  facet("access") %>%
  aggregations()
$access
  doc_count        key
1    184432 controlled

Mutation Annotation Format (MAF) are openly available. These files are tab-delimited text files with aggregated mutation information from VCF files.

files() %>%
  filter(access == 'open') %>%
  filter(experimental_strategy == 'WXS') %>%
  facet("data_format") %>%
  aggregations()
$data_format
  doc_count key
1     16075 maf

Project

Project fields.

all_fields$projects
 [1] "dbgap_accession_number"                               
 [2] "disease_type"                                         
 [3] "intended_release_date"                                
 [4] "name"                                                 
 [5] "primary_site"                                         
 [6] "program.dbgap_accession_number"                       
 [7] "program.name"                                         
 [8] "program.program_id"                                   
 [9] "project_autocomplete"                                 
[10] "project_id"                                           
[11] "releasable"                                           
[12] "released"                                             
[13] "state"                                                
[14] "summary.case_count"                                   
[15] "summary.data_categories.case_count"                   
[16] "summary.data_categories.data_category"                
[17] "summary.data_categories.file_count"                   
[18] "summary.experimental_strategies.case_count"           
[19] "summary.experimental_strategies.experimental_strategy"
[20] "summary.experimental_strategies.file_count"           
[21] "summary.file_count"                                   
[22] "summary.file_size"                                    

Use projects to fetch project information and ids to list all available projects.

projects() %>% results_all() -> project_info

sort(ids(project_info))
 [1] "APOLLO-LUAD"               "BEATAML1.0-COHORT"        
 [3] "BEATAML1.0-CRENOLANIB"     "CDDP_EAGLE-1"             
 [5] "CGCI-BLGSP"                "CGCI-HTMCP-CC"            
 [7] "CGCI-HTMCP-DLBCL"          "CGCI-HTMCP-LC"            
 [9] "CMI-ASC"                   "CMI-MBC"                  
[11] "CMI-MPC"                   "CPTAC-2"                  
[13] "CPTAC-3"                   "CTSP-DLBCL1"              
[15] "EXCEPTIONAL_RESPONDERS-ER" "FM-AD"                    
[17] "GENIE-DFCI"                "GENIE-GRCC"               
[19] "GENIE-JHU"                 "GENIE-MDA"                
[21] "GENIE-MSK"                 "GENIE-NKI"                
[23] "GENIE-UHN"                 "GENIE-VICC"               
[25] "HCMI-CMDC"                 "MATCH-B"                  
[27] "MATCH-N"                   "MATCH-Q"                  
[29] "MATCH-Y"                   "MATCH-Z1D"                
[31] "MMRF-COMMPASS"             "MP2PRT-ALL"               
[33] "MP2PRT-WT"                 "NCICCR-DLBCL"             
[35] "OHSU-CNL"                  "ORGANOID-PANCREATIC"      
[37] "REBC-THYR"                 "TARGET-ALL-P1"            
[39] "TARGET-ALL-P2"             "TARGET-ALL-P3"            
[41] "TARGET-AML"                "TARGET-CCSK"              
[43] "TARGET-NBL"                "TARGET-OS"                
[45] "TARGET-RT"                 "TARGET-WT"                
[47] "TCGA-ACC"                  "TCGA-BLCA"                
[49] "TCGA-BRCA"                 "TCGA-CESC"                
[51] "TCGA-CHOL"                 "TCGA-COAD"                
[53] "TCGA-DLBC"                 "TCGA-ESCA"                
[55] "TCGA-GBM"                  "TCGA-HNSC"                
[57] "TCGA-KICH"                 "TCGA-KIRC"                
[59] "TCGA-KIRP"                 "TCGA-LAML"                
[61] "TCGA-LGG"                  "TCGA-LIHC"                
[63] "TCGA-LUAD"                 "TCGA-LUSC"                
[65] "TCGA-MESO"                 "TCGA-OV"                  
[67] "TCGA-PAAD"                 "TCGA-PCPG"                
[69] "TCGA-PRAD"                 "TCGA-READ"                
[71] "TCGA-SARC"                 "TCGA-SKCM"                
[73] "TCGA-STAD"                 "TCGA-TGCT"                
[75] "TCGA-THCA"                 "TCGA-THYM"                
[77] "TCGA-UCEC"                 "TCGA-UCS"                 
[79] "TCGA-UVM"                  "TRIO-CRU"                 
[81] "VAREPOP-APOLLO"            "WCDT-MCRPC"               

The results() method will fetch actual results.

projects() %>% results(size = 10) -> my_proj

str(my_proj, max.level = 1)
List of 9
 $ id                    : chr [1:10] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ primary_site          :List of 10
 $ dbgap_accession_number: chr [1:10] "phs000528" "phs000465" NA NA ...
 $ project_id            : chr [1:10] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ disease_type          :List of 10
 $ name                  : chr [1:10] "HIV+ Tumor Molecular Characterization Project - Cervical Cancer" "Acute Myeloid Leukemia" "AACR Project GENIE - Contributed by Johns Hopkins Sidney Kimmel Comprehensive Cancer Center" "AACR Project GENIE - Contributed by Memorial Sloan Kettering Cancer Center" ...
 $ releasable            : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ state                 : chr [1:10] "open" "open" "open" "open" ...
 $ released              : logi [1:10] TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
 - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"
my_proj$project_id
 [1] "CGCI-HTMCP-CC" "TARGET-AML"    "GENIE-JHU"     "GENIE-MSK"    
 [5] "GENIE-VICC"    "GENIE-MDA"     "TCGA-MESO"     "TARGET-ALL-P3"
 [9] "TCGA-UVM"      "TCGA-KICH"    

Clinical data

Accessing clinical data.

case_ids <- cases() %>% results(size=10) %>% ids()
clindat <- gdc_clinical(case_ids)
names(clindat)
[1] "demographic" "diagnoses"   "exposures"   "main"       

View available clinical data.

idx <- apply(clindat$diagnoses, 2, function(x) all(is.na(x)))
DT::datatable(clindat$diagnoses[, !idx])

Cases

Find all files related to a specific case, or sample donor.

case1 <- cases() %>% results(size=1)
str(case1, max.level = 1)
List of 25
 $ id                      : chr "935ca1d3-2445-4f59-95a6-19f3311c1900"
 $ lost_to_followup        : chr "No"
 $ slide_ids               :List of 1
 $ submitter_slide_ids     :List of 1
 $ days_to_lost_to_followup: logi NA
 $ disease_type            : chr "Squamous Cell Neoplasms"
 $ analyte_ids             :List of 1
 $ submitter_id            : chr "HTMCP-03-06-02345"
 $ submitter_analyte_ids   :List of 1
 $ days_to_consent         : logi NA
 $ aliquot_ids             :List of 1
 $ submitter_aliquot_ids   :List of 1
 $ created_datetime        : chr "2019-11-21T18:06:42.617487-06:00"
 $ diagnosis_ids           :List of 1
 $ sample_ids              :List of 1
 $ consent_type            : logi NA
 $ submitter_sample_ids    :List of 1
 $ primary_site            : chr "Cervix uteri"
 $ submitter_diagnosis_ids :List of 1
 $ updated_datetime        : chr "2020-04-28T11:49:05.699379-05:00"
 $ case_id                 : chr "935ca1d3-2445-4f59-95a6-19f3311c1900"
 $ index_date              : chr "Diagnosis"
 $ state                   : chr "released"
 $ portion_ids             :List of 1
 $ submitter_portion_ids   :List of 1
 - attr(*, "row.names")= int 1
 - attr(*, "class")= chr [1:3] "GDCcasesResults" "GDCResults" "list"

Sample IDs.

case1$sample_ids
$`935ca1d3-2445-4f59-95a6-19f3311c1900`
[1] "f7706af8-c4e6-4e94-95f1-b6b4901dfe28"
[2] "bb3365f7-7bf9-46c6-ac60-4b7e77268ed8"
[3] "a35a4c87-86f9-4400-b43a-2b0999c69c19"

All case fields.

case_fields <- available_fields("cases")

Grep case_fields.

grep("sample_ids", case_fields, value = TRUE)
[1] "sample_ids"           "submitter_sample_ids"
grep("sample_type", case_fields, value = TRUE)
[1] "samples.sample_type"    "samples.sample_type_id"
grep("workflow_type", case_fields, value = TRUE)
[1] "files.analysis.metadata.read_groups.read_group_qcs.workflow_type"
[2] "files.analysis.workflow_type"                                    
[3] "files.downstream_analyses.workflow_type"                         

Get case data.

n_star_cases <- cases() %>%
  filter(files.analysis.workflow_type == 'STAR - Counts') %>%
  filter(files.access == 'open') %>%
  count()

star_cases <- cases() %>%
  filter(files.analysis.workflow_type == 'STAR - Counts') %>%
  filter(files.access == 'open') %>%
  results(size = n_star_cases)

sapply(star_cases, length)
                      id         lost_to_followup                slide_ids 
                   19101                    19101                    19101 
     submitter_slide_ids days_to_lost_to_followup             disease_type 
                   19101                    19101                    19101 
             analyte_ids             submitter_id    submitter_analyte_ids 
                   19101                    19101                    19101 
         days_to_consent              aliquot_ids    submitter_aliquot_ids 
                   19101                    19101                    19101 
        created_datetime            diagnosis_ids               sample_ids 
                   19101                    19101                    19101 
            consent_type     submitter_sample_ids             primary_site 
                   19101                    19101                    19101 
 submitter_diagnosis_ids         updated_datetime                  case_id 
                   19101                    19101                    19101 
              index_date                    state              portion_ids 
                   19101                    19101                    19101 
   submitter_portion_ids 
                   19101 

case_id is the same as id.

table(star_cases$case_id == star_cases$id)

 TRUE 
19101 

One case ID to multiple sample IDs.

head(star_cases$sample_ids, 3)
$`9453db51-fff8-4a78-a29c-bb9151e9bd2a`
[1] "6662a85c-37b7-48b1-a8c6-f00171bb8226"
[2] "9bab246d-4a0d-4f28-ba1f-56b19a6f93bb"
[3] "6b8ea6bb-d10b-474a-9b4b-f406285dfb2f"

$`9485e946-f569-46fb-b77e-e5af68f7961a`
[1] "e3f781a2-f087-4abb-8f36-af799e837557"
[2] "cc8c2432-4107-4b5a-9452-3c536dac8baf"
[3] "330292a0-80dd-4fc4-a64c-4fce119dcbb6"

$`981300da-9136-402a-88df-2c76b1e3ad87`
[1] "42c67b29-94a1-4520-9122-b2daa02a03ad"
[2] "9d351761-59cb-40f7-aee2-ce2c6365acc2"
[3] "9276070c-cab5-4ba3-978d-2d18976a8758"

Sample IDs to case IDs.

sample_id_len <- sapply(star_cases$sample_ids, length)
my_ids <- rep(names(sample_id_len), sample_id_len)
sample_id_lookup <- data.frame(
  sample_ids = unlist(star_cases$sample_ids),
  case_id = my_ids,
  row.names = NULL
)

head(sample_id_lookup)
                            sample_ids                              case_id
1 6662a85c-37b7-48b1-a8c6-f00171bb8226 9453db51-fff8-4a78-a29c-bb9151e9bd2a
2 9bab246d-4a0d-4f28-ba1f-56b19a6f93bb 9453db51-fff8-4a78-a29c-bb9151e9bd2a
3 6b8ea6bb-d10b-474a-9b4b-f406285dfb2f 9453db51-fff8-4a78-a29c-bb9151e9bd2a
4 e3f781a2-f087-4abb-8f36-af799e837557 9485e946-f569-46fb-b77e-e5af68f7961a
5 cc8c2432-4107-4b5a-9452-3c536dac8baf 9485e946-f569-46fb-b77e-e5af68f7961a
6 330292a0-80dd-4fc4-a64c-4fce119dcbb6 9485e946-f569-46fb-b77e-e5af68f7961a

TCGA

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. This joint effort between NCI and the National Human Genome Research Institute began in 2006, bringing together researchers from diverse disciplines and multiple institutions.

TCGA nomenclature

Acronyms for TCGA cancer types:

  • ACC: adrenocortical
  • BRCA: breast
  • BLCA: bladder
  • COAD: colon
  • ESCA: esophageal
  • GBM: glioblastoma
  • HNSC: head and neck squamous cell
  • KICH: kidney chromophobe
  • KIRC: kidney clear cell
  • KIRP: kidney papillary
  • LGG: low grade glioma
  • LIHC: liver
  • LUAD:lung adenocarcinoma
  • PAAD: pancreatic
  • PRAD: prostate
  • STAD: stomach
  • THCA: thyroid
  • UCEC: endometrial

From https://www.bioconductor.org/packages/release/bioc/vignettes/TCGAbiolinks/inst/doc/query.html

A TCGA barcode is composed of a collection of identifiers. Each specifically identifies a TCGA data element. Refer to the following figure for an illustration of how metadata identifiers comprise a barcode. An aliquot barcode contains the highest number of identifiers. For example:

Aliquot barcode: TCGA-G4-6317-02A-11D-2064-05 Participant: TCGA-G4-6317 Sample: TCGA-G4-6317-02

Fetch projects.

projects() %>% results(size=100) -> my_projects
str(my_projects, max.level = 1)
List of 9
 $ id                    : chr [1:82] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ primary_site          :List of 82
 $ dbgap_accession_number: chr [1:82] "phs000528" "phs000465" NA NA ...
 $ project_id            : chr [1:82] "CGCI-HTMCP-CC" "TARGET-AML" "GENIE-JHU" "GENIE-MSK" ...
 $ disease_type          :List of 82
 $ name                  : chr [1:82] "HIV+ Tumor Molecular Characterization Project - Cervical Cancer" "Acute Myeloid Leukemia" "AACR Project GENIE - Contributed by Johns Hopkins Sidney Kimmel Comprehensive Cancer Center" "AACR Project GENIE - Contributed by Memorial Sloan Kettering Cancer Center" ...
 $ releasable            : logi [1:82] TRUE TRUE TRUE TRUE TRUE TRUE ...
 $ state                 : chr [1:82] "open" "open" "open" "open" ...
 $ released              : logi [1:82] TRUE TRUE TRUE TRUE TRUE TRUE ...
 - attr(*, "row.names")= int [1:82] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:3] "GDCprojectsResults" "GDCResults" "list"

Project IDs.

my_projects$id
 [1] "CGCI-HTMCP-CC"             "TARGET-AML"               
 [3] "GENIE-JHU"                 "GENIE-MSK"                
 [5] "GENIE-VICC"                "GENIE-MDA"                
 [7] "TCGA-MESO"                 "TARGET-ALL-P3"            
 [9] "TCGA-UVM"                  "TCGA-KICH"                
[11] "TARGET-WT"                 "TARGET-OS"                
[13] "TCGA-DLBC"                 "GENIE-UHN"                
[15] "APOLLO-LUAD"               "CDDP_EAGLE-1"             
[17] "EXCEPTIONAL_RESPONDERS-ER" "MP2PRT-WT"                
[19] "CGCI-HTMCP-DLBCL"          "CMI-MPC"                  
[21] "WCDT-MCRPC"                "TCGA-CHOL"                
[23] "TCGA-UCS"                  "TCGA-PCPG"                
[25] "CPTAC-2"                   "TCGA-CESC"                
[27] "TCGA-LIHC"                 "TCGA-ACC"                 
[29] "CMI-MBC"                   "TCGA-BRCA"                
[31] "CPTAC-3"                   "TCGA-COAD"                
[33] "TCGA-GBM"                  "TCGA-TGCT"                
[35] "NCICCR-DLBCL"              "TCGA-LGG"                 
[37] "FM-AD"                     "GENIE-GRCC"               
[39] "CTSP-DLBCL1"               "TARGET-CCSK"              
[41] "GENIE-NKI"                 "TARGET-ALL-P1"            
[43] "MATCH-N"                   "TRIO-CRU"                 
[45] "CMI-ASC"                   "TARGET-RT"                
[47] "ORGANOID-PANCREATIC"       "MATCH-Z1D"                
[49] "MATCH-B"                   "VAREPOP-APOLLO"           
[51] "MATCH-Q"                   "BEATAML1.0-CRENOLANIB"    
[53] "MATCH-Y"                   "OHSU-CNL"                 
[55] "CGCI-HTMCP-LC"             "TARGET-NBL"               
[57] "TCGA-SARC"                 "TCGA-PAAD"                
[59] "TCGA-LUAD"                 "TCGA-PRAD"                
[61] "MP2PRT-ALL"                "TCGA-LUSC"                
[63] "TCGA-LAML"                 "TCGA-SKCM"                
[65] "HCMI-CMDC"                 "BEATAML1.0-COHORT"        
[67] "TCGA-BLCA"                 "TCGA-READ"                
[69] "TCGA-UCEC"                 "TCGA-THCA"                
[71] "TCGA-OV"                   "TCGA-KIRC"                
[73] "MMRF-COMMPASS"             "GENIE-DFCI"               
[75] "TCGA-HNSC"                 "TCGA-ESCA"                
[77] "CGCI-BLGSP"                "TARGET-ALL-P2"            
[79] "TCGA-STAD"                 "REBC-THYR"                
[81] "TCGA-KIRP"                 "TCGA-THYM"                

Treatment type.

cases() %>%
  filter(project.project_id == 'TCGA-OV') %>% 
  facet("diagnoses.treatments.treatment_type") %>%
  aggregations()
$diagnoses.treatments.treatment_type
  doc_count                         key
1       587 pharmaceutical therapy, nos
2       587      radiation therapy, nos
3        21                    _missing

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] GenomicDataCommons_1.26.0 magrittr_2.0.3           
 [3] lubridate_1.9.3           forcats_1.0.0            
 [5] stringr_1.5.0             dplyr_1.1.3              
 [7] purrr_1.0.2               readr_2.1.4              
 [9] tidyr_1.3.0               tibble_3.2.1             
[11] ggplot2_3.4.4             tidyverse_2.0.0          
[13] workflowr_1.7.1          

loaded via a namespace (and not attached):
 [1] gtable_0.3.4            xfun_0.40               bslib_0.5.1            
 [4] htmlwidgets_1.6.2       processx_3.8.2          callr_3.7.3            
 [7] tzdb_0.4.0              crosstalk_1.2.0         vctrs_0.6.4            
[10] tools_4.3.2             ps_1.7.5                bitops_1.0-7           
[13] generics_0.1.3          curl_5.1.0              stats4_4.3.2           
[16] fansi_1.0.5             pkgconfig_2.0.3         S4Vectors_0.40.1       
[19] lifecycle_1.0.3         GenomeInfoDbData_1.2.11 compiler_4.3.2         
[22] git2r_0.32.0            munsell_0.5.0           getPass_0.2-2          
[25] httpuv_1.6.12           GenomeInfoDb_1.38.0     htmltools_0.5.6.1      
[28] sass_0.4.7              RCurl_1.98-1.12         yaml_2.3.7             
[31] crayon_1.5.2            later_1.3.1             pillar_1.9.0           
[34] jquerylib_0.1.4         whisker_0.4.1           ellipsis_0.3.2         
[37] DT_0.30                 cachem_1.0.8            tidyselect_1.2.0       
[40] digest_0.6.33           stringi_1.7.12          rprojroot_2.0.3        
[43] fastmap_1.1.1           grid_4.3.2              colorspace_2.1-0       
[46] cli_3.6.1               utf8_1.2.4              withr_2.5.1            
[49] rappdirs_0.3.3          scales_1.2.1            promises_1.2.1         
[52] timechange_0.2.0        XVector_0.42.0          rmarkdown_2.25         
[55] httr_1.4.7              hms_1.1.3               evaluate_0.22          
[58] knitr_1.44              GenomicRanges_1.54.1    IRanges_2.36.0         
[61] rlang_1.1.1             Rcpp_1.0.11             glue_1.6.2             
[64] xml2_1.3.5              BiocGenerics_0.48.0     rstudioapi_0.15.0      
[67] jsonlite_1.8.7          R6_2.5.1                zlibbioc_1.48.0        
[70] fs_1.6.3