Last updated: 2019-03-07
workflowr checks: (Click a bullet for more information) ✔ R Markdown file: up-to-date
Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.
✔ Environment: empty
Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.
✔ Seed:
set.seed(20190307)
The command set.seed(20190307)
was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.
✔ Session information: recorded
Great job! Recording the operating system, R version, and package versions is critical for reproducibility.
✔ Repository version: 934cda8
wflow_publish
or wflow_git_commit
). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
Ignored files:
Ignored: .DS_Store
Ignored: .Rhistory
Ignored: .Rproj.user/
Ignored: analysis/.DS_Store
Ignored: data/.DS_Store
Untracked files:
Untracked: analysis/README.md
Untracked: analysis/queen_pheromone.Rmd
Untracked: bioinformatics scripts/
Untracked: code/Script to make the donut plot.R
Untracked: code/Script to run multivariate brms model.R
Untracked: code/Script to set up for GO analyses.R
Untracked: code/pdf_supplementary_material.Rmd
Untracked: code/pdf_supplementary_material.pdf
Untracked: code/wojciechowski_histone_analysis.R
Untracked: data/apis.org.db
Untracked: data/apis_gene_comparisons/
Untracked: data/brms_model.txt
Untracked: data/brms_model_comparisons.rds
Untracked: data/brms_model_summary.rds
Untracked: data/component spreadsheets of queen_pheromone.db/
Untracked: data/gene_set_collection.RData
Untracked: data/gene_set_collection_kegg.RData
Untracked: data/morandin_comparison_data/
Untracked: data/most.pheromone.sensitive.genes.rds
Untracked: data/queen_pheromone.db
Untracked: docs/figure/
Untracked: figures/
Untracked: manuscript/
Untracked: supplement/
Unstaged changes:
Modified: .gitignore
Deleted: data/README.md
Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
File | Version | Author | Date | Message |
---|---|---|---|---|
Rmd | 934cda8 | Luke Holman | 2019-03-07 | First commit of new website structure |
There are already sequenced genomes of A. mellifera and B. terrestris and L. niger in NCBI. We’ll have to make something for L. flavus from scratch.
We also need to determine possible isoforms for the genes in the data set, which we will do using tophat for the genomes that have references.
This specis has no isoform data in NCBI, so we create our own. First we create bowtie2 indexed for the reference genomes using bowtie2-build
, and then tophat and cufflinks are executed using tophat.sh
using reference-based mapping to identify alternative splicing.
trinity.sh
assemble lf from raw reads. We are not using a genome guided assembly, since the ln genome is fragmented, as per recommendation of the trinity authors transdecoder.sh
predict proteins from the transcripts.
kallisto We use kallisto for gene expression analysis of transcripts from the NCBI data bases for A. mellifera and B. terrestis, for the TopHat assembly of L. niger and for the Trinity assembly of L. flavus.
rsem and ebseq This is the more traditional approach using the same data sources. We prepare references as ngvector files from the predicted transcripts, as per instructions.
We handle this by reciprocal blastp hit on the proteins vs the honey bee genome (the best annotated of the bunch) e.g., blastp -num_threads 12 -query lf.fa -db amel -outfmt 6 -evalue 1e-4 -max_target_seqs 1
We here present the individual scripts in code chunks, for convenient viewing.
ebseq.sh
#!/bin/bash
#SBATCH --job-name=ebseq
#SBATCH --partition=compute
#SBATCH --mem=500M
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
rsem=/apps/unit/MikheyevU/sasha/RSEM
refdir=./data/rsem
a=($(ls -1 ./data/raw_reads/trimmed/*_1.fastq.gz |awk -F/ '{print $NF}' |cut -c-2 |uniq)) #4
species=${a["SLURM_ARRAY_TASK_ID"]}
echo $species
# contrasts are relative to control treatments, i.e., control goes first
if [ "$species" == "bt" ]; then
samples=`for i in 2 4 5 8 10 1 3 6 7 9; do echo -ne $refdir/$species$i".isoforms.results ";done`
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="5,5"
elif [ "$species" == "am" ]; then
samples=`for i in 2 4 6 1 3; do echo -ne $refdir/$species$i".isoforms.results ";done`
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="3,2"
elif [ "$species" == "lf" ]; then
samples=`for i in 4 6 8 10 12 14 16 1 3 5 7 11 13; do echo -ne $refdir/$species$i".isoforms.results " ;done`
echo $samples
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="7,6"
elif [ "$species" == "ln" ]; then
samples=`for i in 1 3 5 7 11 2 4 6 8 12; do echo -ne $refdir/$species$i".isoforms.results ";done`
$rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
conditions="5,5"
fi
$rsem/rsem-generate-data-matrix `echo $samples | sed 's/isoforms/genes/g'` > $refdir/$species.genes.matrix
$rsem/rsem-run-ebseq $refdir/$species.genes.matrix $conditions $refdir/$species".genes.ebseq"
$rsem/rsem-control-fdr $refdir/$species".genes.ebseq" .05 $refdir/$species".genes.padj.ebseq"
$rsem/rsem-run-ebseq --ngvector ./ref/$species.ngvec $refdir/$species.matrix $conditions $refdir/$species".ebseq"
$rsem/rsem-control-fdr $refdir/$species".ebseq" .05 $refdir/$species".padj.ebseq"
kallisto.sh
#!/bin/bash
#SBATCH --job-name=kalisto
#SBATCH --partition=compute
#SBATCH --mem=3G
#SBATCH --time=5:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
module load kallisto
##species=lf #am, bt, ln
##SLURM_ARRAY_TASK_ID=0
#for sbatch: --export=species=am
#10 for bt, 6 for am, 12 for ln, 15 lf
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz)
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`
if [ "$species" == "bt" ]; then
ref=./ref/GCF_000214255.1_Bter_1.0_rna
elif [ "$species" == "am" ]; then
ref=./ref/GCF_000002195.4_Amel_4.5_rna
elif [ "$species" == "lf" ]; then
ref=./ref/lf
elif [ "$species" == "ln" ]; then
ref=./ref/ln
fi
kallisto quant -i $ref -o data/kallisto/$base -b 100 $f $r
orthodb.py
# takes a list of protein sequences and determines their orthogroup ids
#with python/2.7.8
#takes two files, source of fasta and output
import urllib, json, time, sys
from Bio import SeqIO
urlbase = "http://www.orthodb.org/blast?"
outfile = open(sys.argv[2], "w", 0)
level = "level=7434" #Aculeata
count = 1
start = time.time()
for rec in SeqIO.parse(sys.argv[1], "fasta"):
while True:
try:
response = urllib.urlopen(urlbase + level + "&seq=" + str(rec.seq).replace("*",""))
data = response.read()
parsedData = json.loads(data)
except:
print "http error"
time.sleep(2)
else:
break
if parsedData['data']:
outfile.write(rec.id + "\t" + ",".join(parsedData['data']) + "\n")
count += 1
if count % 100 == 0:
print("%i %.2f" % (count, time.time() - start))
start = time.time()
rsem-generate-data-matrix
#!/usr/bin/perl
use strict;
if (scalar(@ARGV) == 0) {
print "Usage: rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.matrix\n";
print "Results files should be either all .genes.results or all .isoforms.results.\n";
exit(-1);
}
my $offsite = 5; # ###### change to TPM
my $line;
my $n = scalar(@ARGV);
my $M = -1;
my @matrix = ();
# 0, file_name; 1, reference of expected count array; 2, reference of transcript_id/gene_id array
sub loadData {
open(INPUT, $_[0]);
my $line = <INPUT>; # The first line contains only column names
while ($line = <INPUT>) {
chomp($line);
my @fields = split(/\t/, $line);
push(@{$_[2]}, "\"$fields[0]\"");
push(@{$_[1]}, $fields[$offsite]);
}
close(INPUT);
if (scalar(@{$_[1]}) == 0) {
print STDERR "Nothing is detected! $_[0] may not exist or is empty.\n";
exit(-1);
}
}
#0, M; 1, reference of @ids_arr; 2, reference of @ids
sub check {
my $size = $_[0];
for (my $i = 0; $i < $size; $i++) {
if ($_[1]->[$i] ne $_[2]->[$i]) {
return 0;
}
}
return 1;
}
my @ids_arr = ();
for (my $i = 0; $i < $n; $i++) {
my (@ids, @ecs) = ();
&loadData($ARGV[$i], \@ecs, \@ids);
if ($M < 0) {
$M = scalar(@ids);
@ids_arr = @ids;
}
elsif (!&check($M, \@ids_arr, \@ids)) {
print STDERR "Number of lines among samples are not equal!\n";
exit(-1);
}
my $colname;
if (substr($ARGV[$i], 0, 2) eq "./") { $colname = substr($ARGV[$i], 2); }
else { $colname = $ARGV[$i]; }
$colname = "\"$colname\"";
@ecs = ($colname, @ecs);
push(@matrix, \@ecs);
}
@ids_arr = ("", @ids_arr);
@matrix = (\@ids_arr, @matrix);
for (my $i = 0; $i <= $M; $i++) {
for (my $j = 0; $j < $n; $j++) { print "$matrix[$j][$i]\t"; }
print "$matrix[$n][$i]\n";
}
rsem.sh
#!/bin/bash
#SBATCH --job-name=rsem
#SBATCH --partition=compute
#SBATCH --mem=4G
#SBATCH --time=2-00:00:00
#SBATCH --cpus-per-task=10
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
module load rsem bowtie2
cd data/rsem
refdir=../raw_reads/trimmed
a=($refdir/*_1.fastq.gz) #43
b=($refdir/*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
species=`basename $f | cut -c-2`
base=`basename $f _1.fastq.gz`
if [ "$species" == "bt" ]; then
ref=../../ref/bt
elif [ "$species" == "am" ]; then
ref=../../ref/am
elif [ "$species" == "lf" ]; then
ref=../../ref/lf
elif [ "$species" == "ln" ]; then
ref=../../ref/ln
fi
echo rsem-calculate-expression -p 8 --paired-end --bowtie2 $f $r $ref $base
/apps/unit/MikheyevU/sasha/RSEM/rsem-calculate-expression -p 10 --bowtie2 --paired-end $f $r $ref $base
tophat.sh
#!/bin/bash
#SBATCH --job-name=tophat
#SBATCH --partition=compute
##SBATCH --mem=1G
#SBATCH --time=1-00:00:00
##SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
#species=ln #am, bt, ln
#for sbatch: --export=species=am
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz) #10 for bt, 6 for am, 12 for ln
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`
module load bowtie2/2.2.6 tophat/2.1.1 cufflinks/2.2.1
if [ "$species" == "bt" ]; then
gff=./ref/GCF_000214255.1_Bter_1.0_genomic.gff
ref=./ref/bt
elif [ "$species" == "am" ]; then
gff=./ref/GCF_000002195.4_Amel_4.5_genomic.gff
ref=./ref/am
else
gff=./ref/GCA_001045655.1_ASM104565v1_genomic.gff
ref=./ref/ln
fi
echo $SLURM_ARRAY_TASK_ID $species $gff $ref
echo $species $ref $base $gff $f $r
#tophat2 -p 1 -G $gff -o ./data/assembly/tophat/$base $ref $f $r
cufflinks -p 1 -g $gff ./data/assembly/tophat/$base/accepted_hits.bam -o ./data/assembly/tophat/$base
transdecoder.sh
#!/bin/bash
#SBATCH --job-name=transdecoder
#SBATCH --partition=compute
#SBATCH --mem=10G
#SBATCH --cpus-per-task=10
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
##SLURM_ARRAY_TASK_ID=2
module load Trinity/2.1.1
infile=./data/assembly/trinity_lf/Trinity.fasta
base=lf
TransDecoder -t $infile --reuse --workdir ./output_"$base" --search_pfam /apps/unit/MikheyevU/sasha/TransDecoder_r20140704/pfam/Pfam-AB.hmm.bin --CPU 10 -v
trimmomatic.sh
#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --partition=compute
#SBATCH --mem=20G
#SBATCH --cpus-per-task=8
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
trimmomatic=/apps/unit/MikheyevU/sasha/trimmomatic
##SLURM_ARRAY_TASK_ID=1
a=(*/*_1.fastq.gz) # 43
b=(*/*_2.fastq.gz) # 43
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=$(basename $f _1.fastq.gz)
java -jar $trimmomatic/trimmomatic-0.32.jar PE -threads 8 -phred33 $f $r trimmed/$base"_1.fastq.gz" trimmed/$base"_unpaired1.fastq.gz" trimmed/$base"_2.fastq.gz" trimmed/$base"_unpaired2.fastq.gz" ILLUMINACLIP:$trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:25
trinity.sh
#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH --partition=compute
#SBATCH --mem=80G
#SBATCH --time=3-00:00:00
#SBATCH --cpus-per-task=12
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc
species=lf
refdir=../data/raw_reads/merged
module load Trinity/2.1.1 bowtie/1.1.0
Trinity --seqType fq --max_memory 75G --left $refdir/"$species"_1.fastq.gz \
--right $refdir/"$species"_2.fastq.gz \
--CPU 12 --output ../data/assembly/trinity_"$species"
sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] workflowr_1.1.1 Rcpp_1.0.0 lattice_0.20-35
[4] digest_0.6.18 rprojroot_1.3-2 R.methodsS3_1.7.1
[7] grid_3.5.1 jsonlite_1.6 backports_1.1.2
[10] git2r_0.23.0 magrittr_1.5 evaluate_0.11
[13] stringi_1.3.1 whisker_0.3-2 R.oo_1.22.0
[16] R.utils_2.7.0 Matrix_1.2-14 reticulate_1.10
[19] rmarkdown_1.10 tools_3.5.1 stringr_1.3.1
[22] yaml_2.2.0 compiler_3.5.1 htmltools_0.3.6
[25] knitr_1.20
This reproducible R Markdown analysis was created with workflowr 1.1.1