Last updated: 2022-04-28

Checks: 7 0

Knit directory: muse/

This reproducible R Markdown analysis was created with workflowr (version 1.7.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.


Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

The command set.seed(20200712) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility.

The results in this page were generated with repository version 9b739a1. See the Past versions tab to see a history of the changes made to the R Markdown and HTML files.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    r_packages_4.1.2/

Untracked files:
    Untracked:  analysis/cell_ranger.Rmd
    Untracked:  data/ncrna_NONCODE[v3.0].fasta.tar.gz
    Untracked:  data/ncrna_noncode_v3.fa

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.


These are the previous versions of the repository in which changes were made to the R Markdown (analysis/seqinr.Rmd) and HTML (docs/seqinr.html) files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view the files as they were in that past version.

File Version Author Date Message
Rmd 9b739a1 Dave Tang 2022-04-28 Using the seqinr package

The seqinr package provides many useful functions for working with biological sequences in R. We will use data from NONCODE to demonstrate some features of seqinr.

wget 'http://www.noncode.org/datadownload/ncrna_NONCODE[v3.0].fasta.tar.gz'
tar xzf ncrna_NONCODE\[v3.0\].fasta.tar.gz
mv ncrna_NONCODE\[v3.0\].fasta ncrna_noncode_v3.fa
cat ncrna_noncode_v3.fa | grep "^>" | wc -l
411553

Show last couple of entries.

cat ../data/ncrna_noncode_v3.fa | grep "^>" | tail -3
cat: ../data/ncrna_noncode_v3.fa: No such file or directory

Install (if missing) and load seqinr.

if(!require("seqinr", quietly = TRUE)){
  install.packages("seqinr")
}
library("seqinr")

The read.fasta function is used to load a FASTA file and we will use it to load ncrna_noncode_v3.fa.

ncrna <- read.fasta("data/ncrna_noncode_v3.fa")
length(ncrna)
[1] 411553

The entries are saved in a list.

class(ncrna)
[1] "list"

Each list item is named after the first annotation in the FASTA file.

head(names(ncrna))
[1] "nncid" "n1"    "n2"    "n3"    "n4"    "n5"   

Check the first entry, which is stored in index 2, as the first entry is a fake FASTA entry that contains some information on the annotations stored in the FASTA file.

ncrna[[2]]
  [1] "a" "c" "c" "t" "c" "g" "a" "c" "c" "a" "c" "c" "c" "t" "t" "a" "a" "c"
 [19] "t" "t" "g" "g" "g" "t" "g" "c" "a" "g" "g" "t" "a" "t" "t" "c" "a" "a"
 [37] "c" "a" "a" "a" "a" "g" "c" "a" "a" "t" "g" "a" "a" "t" "c" "a" "a" "g"
 [55] "g" "a" "a" "t" "g" "a" "a" "t" "c" "a" "a" "t" "g" "g" "a" "t" "t" "t"
 [73] "t" "c" "a" "a" "t" "g" "g" "a" "t" "t" "t" "a" "t" "g" "g" "a" "t" "t"
 [91] "t" "t" "a" "a" "a" "a" "a" "c" "a" "g" "a" "g" "a" "a" "c" "t" "c" "a"
[109] "g" "a" "a" "a" "t" "c" "t" "a" "a" "c" "a" "g" "a" "a" "a" "t" "t" "t"
[127] "a" "a" "c" "a" "g" "a" "a" "a" "t" "t" "t" "a" "a" "a" "t" "t" "t" "g"
[145] "t" "c" "g" "a" "t" "c" "t" "a" "c" "a" "a" "a" "t" "t" "g" "c" "c" "c"
[163] "t" "t" "a" "t" "c" "t" "t" "t" "t" "t" "c" "c" "a" "t" "c" "t" "t" "a"
[181] "a" "a" "c" "t" "a" "a" "a" "c" "g" "t" "t" "a" "a" "t" "a" "a" "c" "t"
[199] "t" "a" "t" "t" "g" "t" "t" "g" "t" "t" "g" "a" "a" "t" "a" "c" "a" "g"
[217] "c" "t" "t" "g" "t" "g" "g" "a" "a" "t" "g" "t" "c" "g" "g" "g" "g" "t"
[235] "a" "c" "a" "a" "t" "g" "t" "c" "g" "g" "g" "g"
attr(,"name")
[1] "n1"
attr(,"Annot")
[1] ">n1 | AB002583 | tmRNA | chloroplast Cyanidioschyzon merolae | ssrA | NONCODE v2.0 | NULL | NULL | -1.0577600 | -0.3460597"
attr(,"class")
[1] "SeqFastadna"

Count nucleotides.

count(ncrna[[2]], 1)

 a  c  g  t 
86 40 44 76 

Count di-nucleotides.

count(ncrna[[2]], 2)

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 
38 15  9 24 16  7  5 12 15  4 14 10 16 14 16 30 

GC percent.

GC(ncrna[[2]])
[1] 0.3414634

Store entire FASTA headers.

my_header <- getAnnot(ncrna)

Create data frame using FASTA header.

head(my_header[-1])
[[1]]
[1] ">n1 | AB002583 | tmRNA | chloroplast Cyanidioschyzon merolae | ssrA | NONCODE v2.0 | NULL | NULL | -1.0577600 | -0.3460597"

[[2]]
[1] ">n2 | AB002583 | RNase P RNA | chloroplast Cyanidioschyzon merolae | rnpB | NONCODE v2.0 | NULL | NULL | -1.1143400 | -0.4482952"

[[3]]
[1] ">n3 | AB003477 | tmRNA | Synechococcus sp | 10Sa | NONCODE v2.0 | NULL | NULL | -1.3614100 | -0.3138589"

[[4]]
[1] ">n4 | AB007644 | snoRNA | Arabidopsis thaliana (thale cress) | U3 | NONCODE v2.0 | NULL | NULL | -0.5276610 | -0.3278265"

[[5]]
[1] ">n5 | AB009049 | snoRNA | Arabidopsis thaliana (thale cress) | U24 | NONCODE v2.0 | NULL | NULL | -1.0309300 | -0.5177013"

[[6]]
[1] ">n6 | AB009051 | snRNA | Arabidopsis thaliana (thale cress) | U6 | NONCODE v2.0 | NULL | NULL | -1.3622000 | -0.1069238"
my_split <- lapply(my_header[-1], function(x){
  str_split(x, " \\| ", simplify = TRUE)
})

my_df <- do.call(rbind.data.frame, my_split)

names(my_df) <- as.vector(str_split(my_header[[1]], " \\| ", simplify = TRUE))

my_df %>%
  rename(nncid = ">nncid") %>%
  mutate(nncid = sub("^>", "", nncid)) -> my_df

head(my_df)
  nncid     accn       class                            organism name
1    n1 AB002583       tmRNA chloroplast Cyanidioschyzon merolae ssrA
2    n2 AB002583 RNase P RNA chloroplast Cyanidioschyzon merolae rnpB
3    n3 AB003477       tmRNA                    Synechococcus sp 10Sa
4    n4 AB007644      snoRNA  Arabidopsis thaliana (thale cress)   U3
5    n5 AB009049      snoRNA  Arabidopsis thaliana (thale cress)  U24
6    n6 AB009051       snRNA  Arabidopsis thaliana (thale cress)   U6
           ref transcriptID  url   cpcScore       cnci
1 NONCODE v2.0         NULL NULL -1.0577600 -0.3460597
2 NONCODE v2.0         NULL NULL -1.1143400 -0.4482952
3 NONCODE v2.0         NULL NULL -1.3614100 -0.3138589
4 NONCODE v2.0         NULL NULL -0.5276610 -0.3278265
5 NONCODE v2.0         NULL NULL -1.0309300 -0.5177013
6 NONCODE v2.0         NULL NULL -1.3622000 -0.1069238

Organisms with the most ncRNAs.

my_df %>%
  group_by(organism) %>%
  summarise(tally = n()) %>%
  arrange(desc(tally)) %>%
  head()
# A tibble: 6 × 2
  organism                tally
  <chr>                   <int>
1 Mus musculus           119597
2 Drosophila melanogas   102171
3 Homo sapiens            91067
4 Norway rat              66760
5 NULL                    15781
6 Caenorhabditis elegans   4718

Class with the most ncRNAs.

my_df %>%
  group_by(class) %>%
  summarise(tally = n()) %>%
  arrange(desc(tally)) %>%
  head()
# A tibble: 6 × 2
  class              tally
  <chr>              <int>
1 piRNA             174724
2 mature_transcript 102046
3 lncRNA             50615
4 mRNAlike lncRNA    43530
5 miRNA              20550
6 other               4192

Find all human piwi-interacting RNAs (piRNAs) and store their nncid.

my_df %>%
  filter(class == "piRNA", organism =="Homo sapiens") %>%
  pull(nncid) -> nncid_human_pirna

length(nncid_human_pirna)
[1] 32152

Create pirna_human for storing human piRNAs.

pirna_human <- ncrna[nncid_human_pirna]
getSequence(pirna_human[[1]], as.string = TRUE)
[[1]]
[1] "tagtgatgtgttcgttggtaagaggga"

Report number of sequences with N’s and remove them.

my_n <- grepl('n', unlist(getSequence(pirna_human, as.string = TRUE)))
pirna_human <- pirna_human[!my_n]
table(my_n)
my_n
FALSE  TRUE 
31580   572 

Save pirna_human as a FASTA file (not run).

my_anno <- getAnnot(pirna_human)
my_anno <- lapply(my_anno, function(x) sub("^>", "", x))
write.fasta(sequences = pirna_human, names = my_anno, file.out = "human_pirna.fa")

piRNAs typically start with U/T.

prop.table(table(sapply(pirna_human, function(x) x[1])))

         a          c          g          t 
0.05984801 0.06614946 0.08429386 0.78970868 

In addition, piRNAs typically have an A at the tenth base and the proportion below is slightly higher than 0.25 if we expect an equal distribution.

prop.table(table(sapply(pirna_human, function(x) x[10])))

        a         c         g         t 
0.3257758 0.1906586 0.2499683 0.2335972 

Length distribution of piRNAs.

table(getLength(pirna_human))

  16   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
   2    1    3    6    1    6    1    4    7 4186 3792 3901 6580 7662 4238 1188 
  33 
   2 

piRNAs are typically 26-31 nucleotide long, as observed below.

as.data.frame(table(getLength(pirna_human))) %>%
  ggplot(., aes(Var1, Freq)) +
    geom_col() +
    labs(x = "piRNA length", y = "Frequency") +
    theme_bw()

10th base proportion for different lengths.

my_len <- 26:32
my_prop <- sapply(my_len, function(x){
  wanted <- getLength(pirna_human) == x
  prop.table(table(sapply(pirna_human[wanted], function(x) x[10])))
})

as.data.frame(t(my_prop), row.names = my_len)
           a         c         g         t
26 0.2730530 0.1889632 0.2859532 0.2520306
27 0.3364979 0.1993671 0.2217827 0.2423523
28 0.3429890 0.1920021 0.2260959 0.2389131
29 0.3354103 0.1925532 0.2428571 0.2291793
30 0.3308536 0.1866353 0.2582877 0.2242234
31 0.3265691 0.1885323 0.2550731 0.2298254
32 0.3324916 0.1860269 0.2584175 0.2230640

Frequencies of nucleotides along every position.

my_lens <- 26:32
for (my_len in my_lens){
  wanted <- getLength(pirna_human) == my_len
  my_seq <- getSequence(pirna_human[wanted])
  my_freq <- apply(do.call(rbind, my_seq), 2, function(x) prop.table(table(x)))
  
  as.data.frame(my_freq) %>%
    mutate(nuc = c('A', 'C', 'G', 'T')) %>%
    select(nuc, everything()) %>%
    pivot_longer(!nuc, names_to = "Position", values_to = "Frequency") %>%
    mutate(Position = as.integer(sub("^V", "", Position))) -> my_freq_df
  
  ggplot(my_freq_df, aes(Position, Frequency, colour = nuc)) +
    geom_line() +
    geom_point() +
    theme_bw() +
    ggtitle(paste0("Nucleotide frequencies along ", my_len, " nt human piRNAs")) -> p
  
  print(p)
}

Please refer to the seqinr manual for further information.


sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] seqinr_4.2-8    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
 [5] purrr_0.3.4     readr_2.1.1     tidyr_1.1.4     tibble_3.1.6   
 [9] ggplot2_3.3.5   tidyverse_1.3.1 workflowr_1.7.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.8       lubridate_1.8.0  getPass_0.2-2    ps_1.6.0        
 [5] assertthat_0.2.1 rprojroot_2.0.2  digest_0.6.29    utf8_1.2.2      
 [9] R6_2.5.1         cellranger_1.1.0 backports_1.4.1  reprex_2.0.1    
[13] evaluate_0.14    highr_0.9        httr_1.4.2       pillar_1.6.5    
[17] rlang_1.0.0      readxl_1.3.1     rstudioapi_0.13  whisker_0.4     
[21] callr_3.7.0      jquerylib_0.1.4  rmarkdown_2.11   labeling_0.4.2  
[25] munsell_0.5.0    broom_0.7.11     compiler_4.1.2   httpuv_1.6.5    
[29] modelr_0.1.8     xfun_0.29        pkgconfig_2.0.3  htmltools_0.5.2 
[33] tidyselect_1.1.1 fansi_1.0.2      crayon_1.4.2     tzdb_0.2.0      
[37] dbplyr_2.1.1     withr_2.4.3      later_1.3.0      MASS_7.3-54     
[41] grid_4.1.2       jsonlite_1.7.3   gtable_0.3.0     lifecycle_1.0.1 
[45] DBI_1.1.2        git2r_0.29.0     magrittr_2.0.2   scales_1.1.1    
[49] cli_3.1.1        stringi_1.7.6    farver_2.1.0     fs_1.5.2        
[53] promises_1.2.0.1 xml2_1.3.3       ellipsis_0.3.2   generics_0.1.1  
[57] vctrs_0.3.8      tools_4.1.2      ade4_1.7-19      glue_1.6.1      
[61] hms_1.1.1        processx_3.5.2   fastmap_1.1.0    yaml_2.2.2      
[65] colorspace_2.0-2 rvest_1.0.2      knitr_1.37       haven_2.4.3