Reading irregular data into R

Last updated: 2023-06-02

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20200712)

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 140c2a0

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 140c2a0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/
    Ignored:    r_packages_4.2.0/
    Ignored:    r_packages_4.2.2/
    Ignored:    r_packages_4.3.0/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  analysis/tss_xgboost.Rmd
    Untracked:  code/multiz100way/
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz
    Untracked:  data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa
    Untracked:  data/netmhciipan.out.gz
    Untracked:  women.json

Unstaged changes:
    Modified:   analysis/graph.Rmd

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the repository in which changes were made to the R Markdown (analysis/r_parsing.Rmd) and HTML (docs/r_parsing.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File	Version	Author	Date	Message
Rmd	140c2a0	Dave Tang	2023-06-02	Update post
html	27ea248	Dave Tang	2023-05-30	Build site.
Rmd	7432d2f	Dave Tang	2023-05-30	Parsing irregular formats

Some bioinformatics tools output files that are visually nice and are meant for manual inspection. This practice of generating visually nice output (and/or) Excel files may be rooted in how bioinformaticians and biologists used to work with each other; give the bioinformatician/s some data to analyse and they will generate output that will be manually inspected and verified by the biologist/s. I say “used to” because nowadays a lot of biologists are much more savvy with data processing and analysis, and can either analyse their own data or don’t need to do everything manually.

One such tool is NetMHCIIpan, which generates output that requires a bit more work to parse in contrast to parsing a CSV or TSV file. Here’s how the output looks:

gunzip -c data/netmhciipan.out.gz | head -20

# NetMHCIIpan version 4.0

# Input is in FASTA format

# Peptide length 15

# Prediction Mode: EL

# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank)   10%

# Allele: DRB1_0101
--------------------------------------------------------------------------------------------------------------------------------------------
 Pos           MHC              Peptide   Of        Core  Core_Rel        Identity      Score_EL %Rank_EL Exp_Bind  BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
   1     DRB1_0101      MAEMKTDAATLAQEA    3   MKTDAATLA     0.980          P9WNK5      0.097960     9.99       NA   <=WB
   2     DRB1_0101      AEMKTDAATLAQEAG    2   MKTDAATLA     0.867          P9WNK5      0.041685    16.05       NA       
   3     DRB1_0101      EMKTDAATLAQEAGN    4   DAATLAQEA     0.613          P9WNK5      0.012857    28.51       NA       
   4     DRB1_0101      MKTDAATLAQEAGNF    3   DAATLAQEA     0.813          P9WNK5      0.009156    33.36       NA       
   5     DRB1_0101      KTDAATLAQEAGNFE    2   DAATLAQEA     0.627          P9WNK5      0.004353    45.83       NA

The base R function read.table() can skip n number of lines at the start of a file, which would have solved the problem (after figuring out how many lines to skip) but the end of the file contains some free text and lines starting with hyphens, i.e. -.

gunzip -c data/netmhciipan.out.gz | tail

  80     DRB1_0101      GVQYSRADEEQQQAL    3   YSRADEEQQ     0.940          P9WNK5      0.033621    17.96       NA       
  81     DRB1_0101      VQYSRADEEQQQALS    2   YSRADEEQQ     0.900          P9WNK5      0.005510    41.52       NA       
  82     DRB1_0101      QYSRADEEQQQALSS    1   YSRADEEQQ     0.473          P9WNK5      0.001152    73.90       NA       
  83     DRB1_0101      YSRADEEQQQALSSQ    5   EEQQQALSS     0.733          P9WNK5      0.003404    50.77       NA       
  84     DRB1_0101      SRADEEQQQALSSQM    4   EEQQQALSS     0.840          P9WNK5      0.004729    44.32       NA       
  85     DRB1_0101      RADEEQQQALSSQMG    3   EEQQQALSS     0.667          P9WNK5      0.008057    35.30       NA       
  86     DRB1_0101      ADEEQQQALSSQMGF    5   QQALSSQMG     0.593          P9WNK5      0.006416    38.95       NA       
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 3 Number of weak binders: 5
--------------------------------------------------------------------------------------------------------------------------------------------

The approach I opted for, for parsing this output, was to create a function that skips lines that match a specified pattern called a regular expression or regex. We will use the readr package, a very useful package for loading data into R.

library(readr)

The read_lines() function will be used first to read the input. This function reads up to n_max lines from a file and stores each line as a character vector. Notably:

Files ending in .gz, .bz2, .xz or .zip will be automatically uncompressed.
Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded.

Specifying n_max = -1L will read all lines of a file; read_lines is typically used to scan only a small portion of a file to figure out its content but here we will use it to read every line.

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  str()

 chr [1:104] "# NetMHCIIpan version 4.0" "" "# Input is in FASTA format" "" ...

We will define a skip_line function that can be used to skip lines that match a regex. This is typical of how I use Perl or Python to read and store files.

The regex below will match (and therefore skip) lines that:

start with a - (^-) or (|)
start with Number (^Number) or (|)
start with a # (^#) or (|)
are blank, i.e. does not contain any characters (^$) or (|)
start with a space and Pos (^\\sPos).

skip_line <- function(x, regex = "^#"){
  wanted <- !grepl(pattern = regex, x = x, perl = TRUE)
  return(x[wanted])
}

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
  head()

[1] "   1     DRB1_0101      MAEMKTDAATLAQEA    3   MKTDAATLA     0.980          P9WNK5      0.097960     9.99       NA   <=WB"
[2] "   2     DRB1_0101      AEMKTDAATLAQEAG    2   MKTDAATLA     0.867          P9WNK5      0.041685    16.05       NA       "
[3] "   3     DRB1_0101      EMKTDAATLAQEAGN    4   DAATLAQEA     0.613          P9WNK5      0.012857    28.51       NA       "
[4] "   4     DRB1_0101      MKTDAATLAQEAGNF    3   DAATLAQEA     0.813          P9WNK5      0.009156    33.36       NA       "
[5] "   5     DRB1_0101      KTDAATLAQEAGNFE    2   DAATLAQEA     0.627          P9WNK5      0.004353    45.83       NA       "
[6] "   6     DRB1_0101      TDAATLAQEAGNFER    5   LAQEAGNFE     0.640          P9WNK5      0.007844    35.70       NA       "

We use the native R pipe (|>) to pipe the character vector to the skip_line function, which will then skip entries matching our regex; this requires R-4.1.0 or above. Note the use of the _ placeholder for the input vector (this requires R-4.2.0 or above); this is not necessary since the vector will be automatically forwarded to the skip_line function when using |> but I have included it here to show how you can refer to piped input.

Finally, we will use the read_table function (this is similar to the read.table function) to output a tibble from the lines that we want, i.e. those that were not skipped. This function is very useful for this type of data where each column is separated by an irregular number of spaces (one or more).

We will define the column names and the column types that are stored as a vector and list, respectively. This is good practice because this ensures that your column contains the same data type. In addition, by explicitly defining the columns, read_table can assign NAs when data is not available.

my_colnames <- c(
  "pos",
  "mhc",
  "peptide",
  "offset",
  "core",
  "core_rel",
  "id",
  "score_el",
  "perc_rank_el",
  "exp_bind",
  "bind_level"
)

my_coltypes <- list(
  pos = "i",
  mhc = "c",
  peptide = "c",
  offset = "i",
  core = "c",
  core_rel = "d",
  id = "c",
  score_el = "d",
  perc_rank_el = "d",
  exp_bind = "d",
  bind_level = "c"
)

read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
  skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
  read_table(col_names = my_colnames, col_types = my_coltypes) |>
  tail()

# A tibble: 6 × 11
    pos mhc   peptide offset core  core_rel id    score_el perc_rank_el exp_bind
  <int> <chr> <chr>    <int> <chr>    <dbl> <chr>    <dbl>        <dbl>    <dbl>
1    81 DRB1… VQYSRA…      2 YSRA…    0.9   P9WN…  0.00551         41.5       NA
2    82 DRB1… QYSRAD…      1 YSRA…    0.473 P9WN…  0.00115         73.9       NA
3    83 DRB1… YSRADE…      5 EEQQ…    0.733 P9WN…  0.00340         50.8       NA
4    84 DRB1… SRADEE…      4 EEQQ…    0.84  P9WN…  0.00473         44.3       NA
5    85 DRB1… RADEEQ…      3 EEQQ…    0.667 P9WN…  0.00806         35.3       NA
6    86 DRB1… ADEEQQ…      5 QQAL…    0.593 P9WN…  0.00642         39.0       NA
# ℹ 1 more variable: bind_level <chr>

Summary

My initial approach was to use read_lines to figure out how many lines to skip when using read_table but the text at the end of the output resulted in parsing errors. This was when I thought a better approach is to explicitly state which lines to skip, no matter where they occur in the output.

sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readr_2.1.4     workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] bit_4.0.5        jsonlite_1.8.4   crayon_1.5.2     compiler_4.3.0  
 [5] promises_1.2.0.1 tidyselect_1.2.0 Rcpp_1.0.10      stringr_1.5.0   
 [9] git2r_0.32.0     parallel_4.3.0   callr_3.7.3      later_1.3.0     
[13] jquerylib_0.1.4  yaml_2.3.7       fastmap_1.1.1    R6_2.5.1        
[17] knitr_1.42       tibble_3.2.1     rprojroot_2.0.3  tzdb_0.3.0      
[21] bslib_0.4.2      pillar_1.9.0     rlang_1.1.0      utf8_1.2.3      
[25] cachem_1.0.7     stringi_1.7.12   httpuv_1.6.9     xfun_0.39       
[29] getPass_0.2-2    fs_1.6.2         sass_0.4.5       bit64_4.0.5     
[33] cli_3.6.1        magrittr_2.0.3   ps_1.7.5         digest_0.6.31   
[37] processx_3.8.1   vroom_1.6.1      rstudioapi_0.14  hms_1.1.3       
[41] lifecycle_1.0.3  vctrs_0.6.2      evaluate_0.20    glue_1.6.2      
[45] whisker_0.4.1    fansi_1.0.4      rmarkdown_2.21   httr_1.4.5      
[49] tools_4.3.0      pkgconfig_2.0.3  htmltools_0.5.5