Interfacing with databases in R

Last updated: 2024-07-26

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.1). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: e897914

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version e897914. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.3.3/
    Ignored:    r_packages_4.4.0/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/dbi.Rmd) and HTML (docs/dbi.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	e897914	Dave Tang	2024-07-26	dbplyr basics
html	47f20cf	Dave Tang	2024-07-26	Build site.
Rmd	a9be34e	Dave Tang	2024-07-26	Connect to SQLite and make some basic queries
html	6917788	Dave Tang	2024-07-17	Build site.
Rmd	bb8ad42	Dave Tang	2024-07-17	Database basics

Introduction

The {DBI} package provides:

A database interface definition for communication between R and relational database management systems. All classes in this package are virtual and need to be extended by the various R/DBMS implementations.

A database table can be thought of as a data frame, however there are three high-level differences between them:

Database tables are stored on disk and can be arbitrarily large, whereas data frames are stored in memory and are fundamentally limited.
Database tables almost always have indexes making it possible to quickly find rows of interest without having to look at every single row. Data frames don’t have indexes but data tables do, which is one of the reasons why they’re so fast.
Most classical databases are optimised for rapidly collecting data and not for analysing existing data. These databases are called row-oriented because the data is stored row by row, rather than column by column like data frames. More recently, there’s been much development of column-oriented databases that make analysing existing data much faster.

Databases are run by database management systems (DBMS), which come in three basic forms:

Client-server DBMS run on a powerful central server, which you connect from your computer (the client). They are useful for sharing data with multiple people and popular client-server DBMS include PostgreSQL, MariaDB, SQL Server, and Oracle.
Cloud DBMS, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client-server DBMS, but they run in the cloud, taking advantage of cloud capabilities.
In-process DBMS, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you are the primary user.

Connecting to a database

To connect to a database in R, we need:

The DBI package because it provides a set of generic functions that connect to the database.
A specific package tailored for the DBMS of interest; this package translates the generic DBI commands into the specifics.

The {RSQLite} package provides a SQLite interface for R.

Embeds the SQLite database engine in R and provides an interface compliant with the DBI package. The source for the SQLite engine and for various extensions in a recent version is included. System libraries will never be consulted because this package relies on static linking for the plugins it includes; this also ensures a consistent experience across all installations.

SQLite database downloaded as per the post Interfacing with the Sequence Read Archive in R.

dbfile <- "/data2/sradb/SRAmetadb.sqlite"
mydb <- dbConnect(RSQLite::SQLite(), dbname = dbfile)
mydb

<SQLiteConnection>
  Path: /data2/sradb/SRAmetadb.sqlite
  Extensions: TRUE

DBI basics

Lists all tables in the database.

tabs <- dbListTables(mydb)
tabs

 [1] "col_desc"        "experiment"      "metaInfo"        "run"            
 [5] "sample"          "sra"             "sra_ft"          "sra_ft_content" 
 [9] "sra_ft_segdir"   "sra_ft_segments" "study"           "submission"

Neat trick to get the fields of a table.

get_fields <- "SELECT * FROM run WHERE 1=0"
DBI::dbGetQuery(mydb, get_fields)

 [1] run_ID               bamFile              run_alias           
 [4] run_accession        broker_name          instrument_name     
 [7] run_date             run_file             run_center          
[10] total_data_blocks    experiment_accession experiment_name     
[13] sra_link             run_url_link         xref_link           
[16] run_entrez_link      ddbj_link            ena_link            
[19] run_attribute        submission_accession sradb_updated       
<0 rows> (or 0-length row.names)

Save all table fields.

table_fields <- purrr::map(tabs, \(x){
  sql <- paste0('SELECT * FROM ', x, ' WHERE 1=0')
  DBI::dbGetQuery(mydb, sql)
})

names(table_fields) <- tabs
table_fields[[1]]

[1] col_desc_ID   table_name    field_name    type          description  
[6] value_list    sradb_updated
<0 rows> (or 0-length row.names)

table_fields

$col_desc
[1] col_desc_ID   table_name    field_name    type          description  
[6] value_list    sradb_updated
<0 rows> (or 0-length row.names)

$experiment
 [1] experiment_ID                 bamFile                      
 [3] fastqFTP                      experiment_alias             
 [5] experiment_accession          broker_name                  
 [7] center_name                   title                        
 [9] study_name                    study_accession              
[11] design_description            sample_name                  
[13] sample_accession              sample_member                
[15] library_name                  library_strategy             
[17] library_source                library_selection            
[19] library_layout                targeted_loci                
[21] library_construction_protocol spot_length                  
[23] adapter_spec                  read_spec                    
[25] platform                      instrument_model             
[27] platform_parameters           sequence_space               
[29] base_caller                   quality_scorer               
[31] number_of_levels              multiplier                   
[33] qtype                         sra_link                     
[35] experiment_url_link           xref_link                    
[37] experiment_entrez_link        ddbj_link                    
[39] ena_link                      experiment_attribute         
[41] submission_accession          sradb_updated                
<0 rows> (or 0-length row.names)

$metaInfo
[1] name  value
<0 rows> (or 0-length row.names)

$run
 [1] run_ID               bamFile              run_alias           
 [4] run_accession        broker_name          instrument_name     
 [7] run_date             run_file             run_center          
[10] total_data_blocks    experiment_accession experiment_name     
[13] sra_link             run_url_link         xref_link           
[16] run_entrez_link      ddbj_link            ena_link            
[19] run_attribute        submission_accession sradb_updated       
<0 rows> (or 0-length row.names)

$sample
 [1] sample_ID            sample_alias         sample_accession    
 [4] broker_name          center_name          taxon_id            
 [7] scientific_name      common_name          anonymized_name     
[10] individual_name      description          sra_link            
[13] sample_url_link      xref_link            sample_entrez_link  
[16] ddbj_link            ena_link             sample_attribute    
[19] submission_accession sradb_updated       
<0 rows> (or 0-length row.names)

$sra
 [1] sra_ID                        SRR_bamFile                  
 [3] SRX_bamFile                   SRX_fastqFTP                 
 [5] run_ID                        run_alias                    
 [7] run_accession                 run_date                     
 [9] updated_date                  spots                        
[11] bases                         run_center                   
[13] experiment_name               run_url_link                 
[15] run_entrez_link               run_attribute                
[17] experiment_ID                 experiment_alias             
[19] experiment_accession          experiment_title             
[21] study_name                    sample_name                  
[23] design_description            library_name                 
[25] library_strategy              library_source               
[27] library_selection             library_layout               
[29] library_construction_protocol adapter_spec                 
[31] read_spec                     platform                     
[33] instrument_model              instrument_name              
[35] platform_parameters           sequence_space               
[37] base_caller                   quality_scorer               
[39] number_of_levels              multiplier                   
[41] qtype                         experiment_url_link          
[43] experiment_entrez_link        experiment_attribute         
[45] sample_ID                     sample_alias                 
[47] sample_accession              taxon_id                     
[49] common_name                   anonymized_name              
[51] individual_name               description                  
[53] sample_url_link               sample_entrez_link           
[55] sample_attribute              study_ID                     
[57] study_alias                   study_accession              
[59] study_title                   study_type                   
[61] study_abstract                center_project_name          
[63] study_description             study_url_link               
[65] study_entrez_link             study_attribute              
[67] related_studies               primary_study                
[69] submission_ID                 submission_accession         
[71] submission_comment            submission_center            
[73] submission_lab                submission_date              
[75] sradb_updated                
<0 rows> (or 0-length row.names)

$sra_ft
 [1] SRR_bamFile                   SRX_bamFile                  
 [3] SRX_fastqFTP                  run_ID                       
 [5] run_alias                     run_accession                
 [7] run_date                      updated_date                 
 [9] spots                         bases                        
[11] run_center                    experiment_name              
[13] run_url_link                  run_entrez_link              
[15] run_attribute                 experiment_ID                
[17] experiment_alias              experiment_accession         
[19] experiment_title              study_name                   
[21] sample_name                   design_description           
[23] library_name                  library_strategy             
[25] library_source                library_selection            
[27] library_layout                library_construction_protocol
[29] adapter_spec                  read_spec                    
[31] platform                      instrument_model             
[33] instrument_name               platform_parameters          
[35] sequence_space                base_caller                  
[37] quality_scorer                number_of_levels             
[39] multiplier                    qtype                        
[41] experiment_url_link           experiment_entrez_link       
[43] experiment_attribute          sample_ID                    
[45] sample_alias                  sample_accession             
[47] taxon_id                      common_name                  
[49] anonymized_name               individual_name              
[51] description                   sample_url_link              
[53] sample_entrez_link            sample_attribute             
[55] study_ID                      study_alias                  
[57] study_accession               study_title                  
[59] study_type                    study_abstract               
[61] center_project_name           study_description            
[63] study_url_link                study_entrez_link            
[65] study_attribute               related_studies              
[67] primary_study                 submission_ID                
[69] submission_accession          submission_comment           
[71] submission_center             submission_lab               
[73] submission_date               sradb_updated                
<0 rows> (or 0-length row.names)

$sra_ft_content
 [1] docid                            c0SRR_bamFile                   
 [3] c1SRX_bamFile                    c2SRX_fastqFTP                  
 [5] c3run_ID                         c4run_alias                     
 [7] c5run_accession                  c6run_date                      
 [9] c7updated_date                   c8spots                         
[11] c9bases                          c10run_center                   
[13] c11experiment_name               c12run_url_link                 
[15] c13run_entrez_link               c14run_attribute                
[17] c15experiment_ID                 c16experiment_alias             
[19] c17experiment_accession          c18experiment_title             
[21] c19study_name                    c20sample_name                  
[23] c21design_description            c22library_name                 
[25] c23library_strategy              c24library_source               
[27] c25library_selection             c26library_layout               
[29] c27library_construction_protocol c28adapter_spec                 
[31] c29read_spec                     c30platform                     
[33] c31instrument_model              c32instrument_name              
[35] c33platform_parameters           c34sequence_space               
[37] c35base_caller                   c36quality_scorer               
[39] c37number_of_levels              c38multiplier                   
[41] c39qtype                         c40experiment_url_link          
[43] c41experiment_entrez_link        c42experiment_attribute         
[45] c43sample_ID                     c44sample_alias                 
[47] c45sample_accession              c46taxon_id                     
[49] c47common_name                   c48anonymized_name              
[51] c49individual_name               c50description                  
[53] c51sample_url_link               c52sample_entrez_link           
[55] c53sample_attribute              c54study_ID                     
[57] c55study_alias                   c56study_accession              
[59] c57study_title                   c58study_type                   
[61] c59study_abstract                c60center_project_name          
[63] c61study_description             c62study_url_link               
[65] c63study_entrez_link             c64study_attribute              
[67] c65related_studies               c66primary_study                
[69] c67submission_ID                 c68submission_accession         
[71] c69submission_comment            c70submission_center            
[73] c71submission_lab                c72submission_date              
[75] c73sradb_updated                
<0 rows> (or 0-length row.names)

$sra_ft_segdir
[1] level            idx              start_block      leaves_end_block
[5] end_block        root            
<0 rows> (or 0-length row.names)

$sra_ft_segments
[1] blockid block  
<0 rows> (or 0-length row.names)

$study
 [1] study_ID             study_alias          study_accession     
 [4] study_title          study_type           study_abstract      
 [7] broker_name          center_name          center_project_name 
[10] study_description    related_studies      primary_study       
[13] sra_link             study_url_link       xref_link           
[16] study_entrez_link    ddbj_link            ena_link            
[19] study_attribute      submission_accession sradb_updated       
<0 rows> (or 0-length row.names)

$submission
 [1] submission_ID          submission_alias       submission_accession  
 [4] submission_comment     files                  broker_name           
 [7] center_name            lab_name               submission_date       
[10] sra_link               submission_url_link    xref_link             
[13] submission_entrez_link ddbj_link              ena_link              
[16] submission_attribute   sradb_updated         
<0 rows> (or 0-length row.names)

dbplyr basics

Use tbl() to create an object that represents a database table.

run_db <- tbl(src = mydb, 'run')
run_db

# Source:   table<`run`> [?? x 21]
# Database: sqlite 3.46.0 [/data2/sradb/SRAmetadb.sqlite]
   run_ID bamFile run_alias   run_accession broker_name instrument_name run_date
    <dbl> <chr>   <chr>       <chr>         <chr>       <chr>           <chr>   
 1      1 <NA>    2008-09-12… DRR000001     <NA>        <NA>            2008-09…
 2      2 <NA>    2008-09-12… DRR000002     <NA>        <NA>            2008-09…
 3      3 <NA>    DRR000004   DRR000004     <NA>        <NA>            2009-03…
 4      4 <NA>    DRR000005   DRR000005     <NA>        <NA>            2009-03…
 5      5 <NA>    DRR000006   DRR000006     <NA>        <NA>            2009-03…
 6      6 <NA>    DRR000007   DRR000007     <NA>        <NA>            2009-02…
 7      7 <NA>    DRR000363   DRR000363     <NA>        <NA>            2010-02…
 8      8 <NA>    DRR000364   DRR000364     <NA>        <NA>            2010-02…
 9      9 <NA>    DRR000365   DRR000365     <NA>        <NA>            2010-02…
10     10 <NA>    DRR000366   DRR000366     <NA>        <NA>            2010-02…
# ℹ more rows
# ℹ 14 more variables: run_file <chr>, run_center <chr>,
#   total_data_blocks <int>, experiment_accession <chr>, experiment_name <chr>,
#   sra_link <chr>, run_url_link <chr>, xref_link <chr>, run_entrez_link <chr>,
#   ddbj_link <chr>, ena_link <chr>, run_attribute <chr>,
#   submission_accession <chr>, sradb_updated <chr>

Select specific fields.

run_db |>
  select(run_accession, experiment_name)

# Source:   SQL [?? x 2]
# Database: sqlite 3.46.0 [/data2/sradb/SRAmetadb.sqlite]
   run_accession experiment_name
   <chr>         <chr>          
 1 DRR000001     DRX000001      
 2 DRR000002     DRX000002      
 3 DRR000004     DRX000003      
 4 DRR000005     DRX000003      
 5 DRR000006     DRX000003      
 6 DRR000007     DRX000003      
 7 DRR000363     DRX000003      
 8 DRR000364     DRX000003      
 9 DRR000365     DRX000003      
10 DRR000366     DRX000003      
# ℹ more rows

SRA table.

sra_db <- tbl(src = mydb, 'sra')
sra_db

# Source:   table<`sra`> [?? x 75]
# Database: sqlite 3.46.0 [/data2/sradb/SRAmetadb.sqlite]
   sra_ID SRR_bamFile SRX_bamFile SRX_fastqFTP run_ID run_alias    run_accession
    <dbl> <chr>       <chr>       <chr>         <dbl> <chr>        <chr>        
 1      1 <NA>        <NA>        <NA>              1 2008-09-12.… DRR000001    
 2      2 <NA>        <NA>        <NA>              2 2008-09-12.… DRR000002    
 3      3 <NA>        <NA>        <NA>             11 DRR000003    DRR000003    
 4      4 <NA>        <NA>        <NA>              5 DRR000006    DRR000006    
 5      5 <NA>        <NA>        <NA>              3 DRR000004    DRR000004    
 6      6 <NA>        <NA>        <NA>              7 DRR000363    DRR000363    
 7      7 <NA>        <NA>        <NA>              6 DRR000007    DRR000007    
 8      8 <NA>        <NA>        <NA>             10 DRR000366    DRR000366    
 9      9 <NA>        <NA>        <NA>              4 DRR000005    DRR000005    
10     10 <NA>        <NA>        <NA>              8 DRR000364    DRR000364    
# ℹ more rows
# ℹ 68 more variables: run_date <chr>, updated_date <chr>, spots <dbl>,
#   bases <dbl>, run_center <chr>, experiment_name <chr>, run_url_link <chr>,
#   run_entrez_link <chr>, run_attribute <chr>, experiment_ID <dbl>,
#   experiment_alias <chr>, experiment_accession <chr>, experiment_title <chr>,
#   study_name <chr>, sample_name <chr>, design_description <chr>,
#   library_name <chr>, library_strategy <chr>, library_source <chr>, …

End

Disconnect.

DBI::dbDisconnect(mydb)

sessionInfo()

R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
 [5] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 dbplyr_2.5.0    RSQLite_2.3.7  
[13] DBI_1.2.3       workflowr_1.7.1

loaded via a namespace (and not attached):
 [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.4    
 [5] hms_1.1.3         digest_0.6.35     magrittr_2.0.3    timechange_0.3.0 
 [9] evaluate_0.24.0   grid_4.4.0        fastmap_1.2.0     blob_1.2.4       
[13] rprojroot_2.0.4   jsonlite_1.8.8    processx_3.8.4    whisker_0.4.1    
[17] ps_1.7.6          promises_1.3.0    httr_1.4.7        fansi_1.0.6      
[21] scales_1.3.0      jquerylib_0.1.4   cli_3.6.2         rlang_1.1.4      
[25] munsell_0.5.1     bit64_4.0.5       withr_3.0.0       cachem_1.1.0     
[29] yaml_2.3.8        tools_4.4.0       tzdb_0.4.0        memoise_2.0.1    
[33] colorspace_2.1-0  httpuv_1.6.15     vctrs_0.6.5       R6_2.5.1         
[37] lifecycle_1.0.4   git2r_0.33.0      fs_1.6.4          bit_4.0.5        
[41] pkgconfig_2.0.3   callr_3.7.6       gtable_0.3.5      pillar_1.9.0     
[45] bslib_0.7.0       later_1.3.2       glue_1.7.0        Rcpp_1.0.12      
[49] xfun_0.44         tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.47       
[53] htmltools_0.5.8.1 rmarkdown_2.27    compiler_4.4.0    getPass_0.2-4