Last updated: 2023-05-30
Checks: 7 0
Knit directory: muse/
This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
The command set.seed(20200712)
was run prior to running
the code in the R Markdown file. Setting a seed ensures that any results
that rely on randomness, e.g. subsampling or permutations, are
reproducible.
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.
Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.
Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.
The results in this page were generated with repository version 7432d2f. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.
Note that you need to be careful to ensure that all relevant files for
the analysis have been committed to Git prior to generating the results
(you can use wflow_publish
or
wflow_git_commit
). workflowr only checks the R Markdown
file, but you know if there are other scripts or data files that it
depends on. Below is the status of the Git repository when the results
were generated:
Ignored files:
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: r_packages_4.1.2/
Ignored: r_packages_4.2.0/
Ignored: r_packages_4.2.2/
Ignored: r_packages_4.3.0/
Untracked files:
Untracked: analysis/cell_ranger.Rmd
Untracked: analysis/tss_xgboost.Rmd
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz
Untracked: data/HG00702_SH089_CHSTrio.chr1.vcf.gz.tbi
Untracked: data/ncrna_NONCODE[v3.0].fasta.tar.gz
Untracked: data/ncrna_noncode_v3.fa
Untracked: data/netmhciipan.out.gz
Untracked: women.json
Unstaged changes:
Modified: analysis/graph.Rmd
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
These are the previous versions of the repository in which changes were
made to the R Markdown (analysis/r_parsing.Rmd
) and HTML
(docs/r_parsing.html
) files. If you’ve configured a remote
Git repository (see ?wflow_git_remote
), click on the
hyperlinks in the table below to view the files as they were in that
past version.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 7432d2f | Dave Tang | 2023-05-30 | Parsing irregular formats |
Some bioinformatics tools output files that are visually nice and are meant for manual inspection. This practice of generating visually nice output (and/or) Excel files may be rooted in how bioinformaticians and biologists used to work with each other; give the bioinformatician/s some data and they will generate output that will be manually inspected and verified by the biologist/s. I say “used to” because nowadays a lot of biologists are much more savvy with data processing and analysis, and don’t need to do things manually.
One such tool is NetMHCIIpan, which generates output that requires a bit more work to parse in contrast to parsing a CSV or TSV file.
gunzip -c data/netmhciipan.out.gz | head -20
# NetMHCIIpan version 4.0
# Input is in FASTA format
# Peptide length 15
# Prediction Mode: EL
# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank) 10%
# Allele: DRB1_0101
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 DRB1_0101 MAEMKTDAATLAQEA 3 MKTDAATLA 0.980 P9WNK5 0.097960 9.99 NA <=WB
2 DRB1_0101 AEMKTDAATLAQEAG 2 MKTDAATLA 0.867 P9WNK5 0.041685 16.05 NA
3 DRB1_0101 EMKTDAATLAQEAGN 4 DAATLAQEA 0.613 P9WNK5 0.012857 28.51 NA
4 DRB1_0101 MKTDAATLAQEAGNF 3 DAATLAQEA 0.813 P9WNK5 0.009156 33.36 NA
5 DRB1_0101 KTDAATLAQEAGNFE 2 DAATLAQEA 0.627 P9WNK5 0.004353 45.83 NA
We will use the readr
package, a very useful package for
loading data into R.
library(readr)
The read_lines()
function reads up to n_max
lines from a file and stores
each line as a character vector.
.gz
, .bz2
,
.xz
or .zip
will be automatically
uncompressed.http://
, https://
,
ftp://
, or ftps://
will be automatically
downloaded.n_max = -1L
will read all lines.read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
str()
chr [1:104] "# NetMHCIIpan version 4.0" "" "# Input is in FASTA format" "" ...
We will define a skip_line
function that can be used to
skip lines that match a regular expression (regex). The regex below will
match (and skip) lines that:
-
(^-
) or
(|
)Number
(^Number
) or
(|
)#
(^#
) or
(|
)^$
) or
(|
)Pos
(^\\sPos
).skip_line <- function(x, regex = "^#"){
wanted <- !grepl(pattern = regex, x = x, perl = TRUE)
return(x[wanted])
}
read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
head()
[1] " 1 DRB1_0101 MAEMKTDAATLAQEA 3 MKTDAATLA 0.980 P9WNK5 0.097960 9.99 NA <=WB"
[2] " 2 DRB1_0101 AEMKTDAATLAQEAG 2 MKTDAATLA 0.867 P9WNK5 0.041685 16.05 NA "
[3] " 3 DRB1_0101 EMKTDAATLAQEAGN 4 DAATLAQEA 0.613 P9WNK5 0.012857 28.51 NA "
[4] " 4 DRB1_0101 MKTDAATLAQEAGNF 3 DAATLAQEA 0.813 P9WNK5 0.009156 33.36 NA "
[5] " 5 DRB1_0101 KTDAATLAQEAGNFE 2 DAATLAQEA 0.627 P9WNK5 0.004353 45.83 NA "
[6] " 6 DRB1_0101 TDAATLAQEAGNFER 5 LAQEAGNFE 0.640 P9WNK5 0.007844 35.70 NA "
We use the native R pipe (|>
) to pipe the character
vector to the skip_line
function, which will then skip
entries matching our regex. Note the use of the _
placeholder for the input vector; this is not necessary since the vector
will be automatically forwarded to the skip_line
function
when using |>
.
Finally, we will use the read_table
function to output a tibble
from the lines that we want,
i.e. were not skipped. This function is very useful for this type of
data where each column is separated by an irregular number of spaces
(one or more).
We will define the column names and the column types that are stored as a vector and list, respectively. This is good practice because this ensures that your column contains the same data type.
my_colnames <- c(
"pos",
"mhc",
"peptide",
"offset",
"core",
"core_rel",
"id",
"score_el",
"perc_rank_el",
"exp_bind",
"bind_level"
)
my_coltypes <- list(
pos = "i",
mhc = "c",
peptide = "c",
offset = "i",
core = "c",
core_rel = "d",
id = "c",
score_el = "d",
perc_rank_el = "d",
exp_bind = "d",
bind_level = "c"
)
read_lines("data/netmhciipan.out.gz", n_max = -1L) |>
skip_line(x = _, regex = "^-|^Number|^#|^$|^\\sPos") |>
read_table(col_names = my_colnames, col_types = my_coltypes) |>
tail()
# A tibble: 6 × 11
pos mhc peptide offset core core_rel id score_el perc_rank_el exp_bind
<int> <chr> <chr> <int> <chr> <dbl> <chr> <dbl> <dbl> <dbl>
1 81 DRB1… VQYSRA… 2 YSRA… 0.9 P9WN… 0.00551 41.5 NA
2 82 DRB1… QYSRAD… 1 YSRA… 0.473 P9WN… 0.00115 73.9 NA
3 83 DRB1… YSRADE… 5 EEQQ… 0.733 P9WN… 0.00340 50.8 NA
4 84 DRB1… SRADEE… 4 EEQQ… 0.84 P9WN… 0.00473 44.3 NA
5 85 DRB1… RADEEQ… 3 EEQQ… 0.667 P9WN… 0.00806 35.3 NA
6 86 DRB1… ADEEQQ… 5 QQAL… 0.593 P9WN… 0.00642 39.0 NA
# ℹ 1 more variable: bind_level <chr>
sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] readr_2.1.4 workflowr_1.7.0
loaded via a namespace (and not attached):
[1] bit_4.0.5 jsonlite_1.8.4 crayon_1.5.2 compiler_4.3.0
[5] promises_1.2.0.1 tidyselect_1.2.0 Rcpp_1.0.10 stringr_1.5.0
[9] git2r_0.32.0 parallel_4.3.0 callr_3.7.3 later_1.3.0
[13] jquerylib_0.1.4 yaml_2.3.7 fastmap_1.1.1 R6_2.5.1
[17] knitr_1.42 tibble_3.2.1 rprojroot_2.0.3 tzdb_0.3.0
[21] bslib_0.4.2 pillar_1.9.0 rlang_1.1.0 utf8_1.2.3
[25] cachem_1.0.7 stringi_1.7.12 httpuv_1.6.9 xfun_0.39
[29] getPass_0.2-2 fs_1.6.2 sass_0.4.5 bit64_4.0.5
[33] cli_3.6.1 magrittr_2.0.3 ps_1.7.5 digest_0.6.31
[37] processx_3.8.1 vroom_1.6.1 rstudioapi_0.14 hms_1.1.3
[41] lifecycle_1.0.3 vctrs_0.6.2 evaluate_0.20 glue_1.6.2
[45] whisker_0.4.1 fansi_1.0.4 rmarkdown_2.21 httr_1.4.5
[49] tools_4.3.0 pkgconfig_2.0.3 htmltools_0.5.5