Last updated: 2023-06-02
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 140c2a0. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: r_packages_4.1.2/
Ignored: r_packages_4.2.0/
Ignored: r_packages_4.2.2/
Ignored: r_packages_4.3.0/
Untracked files:
Untracked: analysis/cell_ranger.Rmd
Untracked: analysis/tss_xgboost.Rmd
Untracked: code/multiz100way/
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
Untracked: data/ncrna_NONCODE[v3.0].fasta.tar.gz
Untracked: data/ncrna_noncode_v3.fa
Untracked: data/netmhciipan.out.gz
Untracked: women.json
Unstaged changes:
Modified: analysis/graph.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/r_parsing.Rmd
) and HTML
(docs/r_parsing.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 140c2a0 | Dave Tang | 2023-06-02 | Update post |
html | 27ea248 | Dave Tang | 2023-05-30 | Build site. |
Rmd | 7432d2f | Dave Tang | 2023-05-30 | Parsing irregular formats |
Some bioinformatics tools output files that are visually nice and are meant for manual inspection. This practice of generating visually nice output (and/or) Excel files may be rooted in how bioinformaticians and biologists used to work with each other; give the bioinformatician/s some data to analyse and they will generate output that will be manually inspected and verified by the biologist/s. I say “used to” because nowadays a lot of biologists are much more savvy with data processing and analysis, and can either analyse their own data or don’t need to do everything manually.
One such tool is NetMHCIIpan, which generates output that requires a bit more work to parse in contrast to parsing a CSV or TSV file. Here’s how the output looks:
gunzip -c data/netmhciipan.out.gz | head -20
# NetMHCIIpan version 4.0
# Input is in FASTA format
# Peptide length 15
# Prediction Mode: EL
# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank) 10%
# Allele: DRB1_0101
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 DRB1_0101 MAEMKTDAATLAQEA 3 MKTDAATLA 0.980 P9WNK5 0.097960 9.99 NA <=WB
2 DRB1_0101 AEMKTDAATLAQEAG 2 MKTDAATLA 0.867 P9WNK5 0.041685 16.05 NA
3 DRB1_0101 EMKTDAATLAQEAGN 4 DAATLAQEA 0.613 P9WNK5 0.012857 28.51 NA
4 DRB1_0101 MKTDAATLAQEAGNF 3 DAATLAQEA 0.813 P9WNK5 0.009156 33.36 NA
5 DRB1_0101 KTDAATLAQEAGNFE 2 DAATLAQEA 0.627 P9WNK5 0.004353 45.83 NA
The base R function read.table()
can skip n
number of lines at the start of a file, which would have solved the
problem (after figuring out how many lines to skip) but the end of the
file contains some free text and lines starting with hyphens,
i.e. -
.
gunzip -c data/netmhciipan.out.gz | tail
80 DRB1_0101 GVQYSRADEEQQQAL 3 YSRADEEQQ 0.940 P9WNK5 0.033621 17.96 NA
81 DRB1_0101 VQYSRADEEQQQALS 2 YSRADEEQQ 0.900 P9WNK5 0.005510 41.52 NA
82 DRB1_0101 QYSRADEEQQQALSS 1 YSRADEEQQ 0.473 P9WNK5 0.001152 73.90 NA
83 DRB1_0101 YSRADEEQQQALSSQ 5 EEQQQALSS 0.733 P9WNK5 0.003404 50.77 NA
84 DRB1_0101 SRADEEQQQALSSQM 4 EEQQQALSS 0.840 P9WNK5 0.004729 44.32 NA
85 DRB1_0101 RADEEQQQALSSQMG 3 EEQQQALSS 0.667 P9WNK5 0.008057 35.30 NA
86 DRB1_0101 ADEEQQQALSSQMGF 5 QQALSSQMG 0.593 P9WNK5 0.006416 38.95 NA
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 3 Number of weak binders: 5
--------------------------------------------------------------------------------------------------------------------------------------------
The approach I opted for, for parsing this output, was to create a
function that skips lines that match a specified pattern called a
regular expression or regex. We will use the readr
package,
a very useful package for loading data into R.
library(readr)
The read_lines()
function will be used first to read the input. This function reads up to
n_max
lines from a file and stores each line as a character
vector. Notably:
.gz
, .bz2
,
.xz
or .zip
will be automatically
uncompressed.http://
, https://
,
ftp://
, or ftps://
will be automatically
downloaded.Specifying n_max = -1L
will read all lines of a file;
read_lines
is typically used to scan only a small portion
of a file to figure out its content but here we will use it to read
every line.
read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
str()
chr [1:104] "# NetMHCIIpan version 4.0" "" "# Input is in FASTA format" "" ...
We will define a skip_line
function that can be used to
skip lines that match a regex. This is typical of how I use Perl or
Python to read and store files.
The regex below will match (and therefore skip) lines that:
-
(^-
) or
(|
)Number
(^Number
) or
(|
)#
(^#
) or
(|
)^$
) or
(|
)Pos
(^\\sPos
).skip_line <- function(x, regex = "^#"){
wanted <- !grepl(pattern = regex, x = x, perl = TRUE)
return(x[wanted])
}
read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
head()
[1] " 1 DRB1_0101 MAEMKTDAATLAQEA 3 MKTDAATLA 0.980 P9WNK5 0.097960 9.99 NA <=WB"
[2] " 2 DRB1_0101 AEMKTDAATLAQEAG 2 MKTDAATLA 0.867 P9WNK5 0.041685 16.05 NA "
[3] " 3 DRB1_0101 EMKTDAATLAQEAGN 4 DAATLAQEA 0.613 P9WNK5 0.012857 28.51 NA "
[4] " 4 DRB1_0101 MKTDAATLAQEAGNF 3 DAATLAQEA 0.813 P9WNK5 0.009156 33.36 NA "
[5] " 5 DRB1_0101 KTDAATLAQEAGNFE 2 DAATLAQEA 0.627 P9WNK5 0.004353 45.83 NA "
[6] " 6 DRB1_0101 TDAATLAQEAGNFER 5 LAQEAGNFE 0.640 P9WNK5 0.007844 35.70 NA "
We use the native R pipe (|>
) to pipe the character
vector to the skip_line
function, which will then skip
entries matching our regex; this requires R-4.1.0 or above. Note the use
of the _
placeholder for the input vector (this requires
R-4.2.0 or above); this is not necessary since the vector will be
automatically forwarded to the skip_line
function when
using |>
but I have included it here to show how you can
refer to piped input.
Finally, we will use the read_table
function (this is similar to the read.table
function) to
output a tibble
from the lines that we want, i.e. those
that were not skipped. This function is very useful for this type of
data where each column is separated by an irregular number of spaces
(one or more).
We will define the column names and the column types that are stored
as a vector and list, respectively. This is good practice because this
ensures that your column contains the same data type. In addition, by
explicitly defining the columns, read_table
can assign
NA
s when data is not available.
my_colnames <- c(
"pos",
"mhc",
"peptide",
"offset",
"core",
"core_rel",
"id",
"score_el",
"perc_rank_el",
"exp_bind",
"bind_level"
)
my_coltypes <- list(
pos = "i",
mhc = "c",
peptide = "c",
offset = "i",
core = "c",
core_rel = "d",
id = "c",
score_el = "d",
perc_rank_el = "d",
exp_bind = "d",
bind_level = "c"
)
read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
read_table(col_names = my_colnames, col_types = my_coltypes) |>
tail()
# A tibble: 6 × 11
pos mhc peptide offset core core_rel id score_el perc_rank_el exp_bind
<int> <chr> <chr> <int> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 81 DRB1… VQYSRA… 2 YSRA… 0.9 P9WN… 0.00551 41.5 NA
2 82 DRB1… QYSRAD… 1 YSRA… 0.473 P9WN… 0.00115 73.9 NA
3 83 DRB1… YSRADE… 5 EEQQ… 0.733 P9WN… 0.00340 50.8 NA
4 84 DRB1… SRADEE… 4 EEQQ… 0.84 P9WN… 0.00473 44.3 NA
5 85 DRB1… RADEEQ… 3 EEQQ… 0.667 P9WN… 0.00806 35.3 NA
6 86 DRB1… ADEEQQ… 5 QQAL… 0.593 P9WN… 0.00642 39.0 NA
# ℹ 1 more variable: bind_level <chr>
My initial approach was to use read_lines
to figure out
how many lines to skip when using read_table
but the text
at the end of the output resulted in parsing errors. This was when I
thought a better approach is to explicitly state which lines to skip, no
matter where they occur in the output.
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_2.1.4 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] bit_4.0.5 jsonlite_1.8.4 crayon_1.5.2 compiler_4.3.0
[5] promises_1.2.0.1 tidyselect_1.2.0 Rcpp_1.0.10 stringr_1.5.0
[9] git2r_0.32.0 parallel_4.3.0 callr_3.7.3 later_1.3.0
[13] jquerylib_0.1.4 yaml_2.3.7 fastmap_1.1.1 R6_2.5.1
[17] knitr_1.42 tibble_3.2.1 rprojroot_2.0.3 tzdb_0.3.0
[21] bslib_0.4.2 pillar_1.9.0 rlang_1.1.0 utf8_1.2.3
[25] cachem_1.0.7 stringi_1.7.12 httpuv_1.6.9 xfun_0.39
[29] getPass_0.2-2 fs_1.6.2 sass_0.4.5 bit64_4.0.5
[33] cli_3.6.1 magrittr_2.0.3 ps_1.7.5 digest_0.6.31
[37] processx_3.8.1 vroom_1.6.1 rstudioapi_0.14 hms_1.1.3
[41] lifecycle_1.0.3 vctrs_0.6.2 evaluate_0.20 glue_1.6.2
[45] whisker_0.4.1 fansi_1.0.4 rmarkdown_2.21 httr_1.4.5
[49] tools_4.3.0 pkgconfig_2.0.3 htmltools_0.5.5