Reading irregular data into R

Last updated: 2023-05-30

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 7432d2f

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 7432d2f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/
    Ignored:    r_packages_4.2.0/
    Ignored:    r_packages_4.2.2/
    Ignored:    r_packages_4.3.0/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/r_parsing.Rmd) and HTML (docs/r_parsing.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	7432d2f	Dave Tang	2023-05-30	Parsing irregular formats

Some bioinformatics tools output files that are visually nice and are meant for manual inspection. This practice of generating visually nice output (and/or) Excel files may be rooted in how bioinformaticians and biologists used to work with each other; give the bioinformatician/s some data and they will generate output that will be manually inspected and verified by the biologist/s. I say “used to” because nowadays a lot of biologists are much more savvy with data processing and analysis, and don’t need to do things manually.

One such tool is NetMHCIIpan, which generates output that requires a bit more work to parse in contrast to parsing a CSV or TSV file.

gunzip -c data/netmhciipan.out.gz | head -20

# NetMHCIIpan version 4.0

# Input is in FASTA format

# Peptide length 15

# Prediction Mode: EL

# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank)   10%

# Allele: DRB1_0101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos           MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1     DRB1_0101      MAEMKTDAATLAQEA    3   MKTDAATLA     0.980          P9WNK5      0.097960     9.99       NA   <=WB
   2     DRB1_0101      AEMKTDAATLAQEAG    2   MKTDAATLA     0.867          P9WNK5      0.041685    16.05       NA       
   3     DRB1_0101      EMKTDAATLAQEAGN    4   DAATLAQEA     0.613          P9WNK5      0.012857    28.51       NA       
   4     DRB1_0101      MKTDAATLAQEAGNF    3   DAATLAQEA     0.813          P9WNK5      0.009156    33.36       NA       
   5     DRB1_0101      KTDAATLAQEAGNFE    2   DAATLAQEA     0.627          P9WNK5      0.004353    45.83       NA

We will use the readr package, a very useful package for loading data into R.

library(readr)

The read_lines() function reads up to n_max lines from a file and stores each line as a character vector.

Files ending in .gz, .bz2, .xz or .zip will be automatically uncompressed.
Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded.
n_max = -1L will read all lines.

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  str()

 chr [1:104] "# NetMHCIIpan version 4.0" "" "# Input is in FASTA format" "" ...

We will define a skip_line function that can be used to skip lines that match a regular expression (regex). The regex below will match (and skip) lines that:

start with a - (^-) or (|)
start with Number (^Number) or (|)
start with a # (^#) or (|)
are blank, i.e. does not contain any characters (^$) or (|)
start with a space and Pos (^\\sPos).

skip_line <- function(x, regex = "^#"){
  wanted <- !grepl(pattern = regex, x = x, perl = TRUE)
  return(x[wanted])
}

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
  head()

[1] "   1     DRB1_0101      MAEMKTDAATLAQEA    3   MKTDAATLA     0.980          P9WNK5      0.097960     9.99       NA   <=WB"
[2] "   2     DRB1_0101      AEMKTDAATLAQEAG    2   MKTDAATLA     0.867          P9WNK5      0.041685    16.05       NA       "
[3] "   3     DRB1_0101      EMKTDAATLAQEAGN    4   DAATLAQEA     0.613          P9WNK5      0.012857    28.51       NA       "
[4] "   4     DRB1_0101      MKTDAATLAQEAGNF    3   DAATLAQEA     0.813          P9WNK5      0.009156    33.36       NA       "
[5] "   5     DRB1_0101      KTDAATLAQEAGNFE    2   DAATLAQEA     0.627          P9WNK5      0.004353    45.83       NA       "
[6] "   6     DRB1_0101      TDAATLAQEAGNFER    5   LAQEAGNFE     0.640          P9WNK5      0.007844    35.70       NA       "

We use the native R pipe (|>) to pipe the character vector to the skip_line function, which will then skip entries matching our regex. Note the use of the _ placeholder for the input vector; this is not necessary since the vector will be automatically forwarded to the skip_line function when using |>.

Finally, we will use the read_table function to output a tibble from the lines that we want, i.e. were not skipped. This function is very useful for this type of data where each column is separated by an irregular number of spaces (one or more).

We will define the column names and the column types that are stored as a vector and list, respectively. This is good practice because this ensures that your column contains the same data type.

my_colnames <- c(
  "pos",
  "mhc",
  "peptide",
  "offset",
  "core",
  "core_rel",
  "id",
  "score_el",
  "perc_rank_el",
  "exp_bind",
  "bind_level"
)

my_coltypes <- list(
  pos = "i",
  mhc = "c",
  peptide = "c",
  offset = "i",
  core = "c",
  core_rel = "d",
  id = "c",
  score_el = "d",
  perc_rank_el = "d",
  exp_bind = "d",
  bind_level = "c"
)

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
  read_table(col_names = my_colnames, col_types = my_coltypes) |>
  tail()

# A tibble: 6 × 11
    pos mhc   peptide offset core  core_rel id    score_el perc_rank_el exp_bind
  <int> <chr> <chr>    <int> <chr>    <dbl> <chr>    <dbl>        <dbl>    <dbl>
1    81 DRB1… VQYSRA…      2 YSRA…    0.9   P9WN…  0.00551         41.5       NA
2    82 DRB1… QYSRAD…      1 YSRA…    0.473 P9WN…  0.00115         73.9       NA
3    83 DRB1… YSRADE…      5 EEQQ…    0.733 P9WN…  0.00340         50.8       NA
4    84 DRB1… SRADEE…      4 EEQQ…    0.84  P9WN…  0.00473         44.3       NA
5    85 DRB1… RADEEQ…      3 EEQQ…    0.667 P9WN…  0.00806         35.3       NA
6    86 DRB1… ADEEQQ…      5 QQAL…    0.593 P9WN…  0.00642         39.0       NA
# ℹ 1 more variable: bind_level <chr>

sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.4     workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] bit_4.0.5        jsonlite_1.8.4   crayon_1.5.2     compiler_4.3.0  
 [5] promises_1.2.0.1 tidyselect_1.2.0 Rcpp_1.0.10      stringr_1.5.0   
 [9] git2r_0.32.0     parallel_4.3.0   callr_3.7.3      later_1.3.0     
[13] jquerylib_0.1.4  yaml_2.3.7       fastmap_1.1.1    R6_2.5.1        
[17] knitr_1.42       tibble_3.2.1     rprojroot_2.0.3  tzdb_0.3.0      
[21] bslib_0.4.2      pillar_1.9.0     rlang_1.1.0      utf8_1.2.3      
[25] cachem_1.0.7     stringi_1.7.12   httpuv_1.6.9     xfun_0.39       
[29] getPass_0.2-2    fs_1.6.2         sass_0.4.5       bit64_4.0.5     
[33] cli_3.6.1        magrittr_2.0.3   ps_1.7.5         digest_0.6.31   
[37] processx_3.8.1   vroom_1.6.1      rstudioapi_0.14  hms_1.1.3       
[41] lifecycle_1.0.3  vctrs_0.6.2      evaluate_0.20    glue_1.6.2      
[45] whisker_0.4.1    fansi_1.0.4      rmarkdown_2.21   httr_1.4.5      
[49] tools_4.3.0      pkgconfig_2.0.3  htmltools_0.5.5