Last updated: 2019-03-07

workflowr checks: (Click a bullet for more information)
  • R Markdown file: up-to-date

    Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

  • Environment: empty

    Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

  • Seed: set.seed(20190307)

    The command set.seed(20190307) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

  • Session information: recorded

    Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

  • Repository version: 934cda8

    Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

    Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:
    
    Ignored files:
        Ignored:    .DS_Store
        Ignored:    .Rhistory
        Ignored:    .Rproj.user/
        Ignored:    analysis/.DS_Store
        Ignored:    data/.DS_Store
    
    Untracked files:
        Untracked:  analysis/README.md
        Untracked:  analysis/queen_pheromone.Rmd
        Untracked:  bioinformatics scripts/
        Untracked:  code/Script to make the donut plot.R
        Untracked:  code/Script to run multivariate brms model.R
        Untracked:  code/Script to set up for GO analyses.R
        Untracked:  code/pdf_supplementary_material.Rmd
        Untracked:  code/pdf_supplementary_material.pdf
        Untracked:  code/wojciechowski_histone_analysis.R
        Untracked:  data/apis.org.db
        Untracked:  data/apis_gene_comparisons/
        Untracked:  data/brms_model.txt
        Untracked:  data/brms_model_comparisons.rds
        Untracked:  data/brms_model_summary.rds
        Untracked:  data/component spreadsheets of queen_pheromone.db/
        Untracked:  data/gene_set_collection.RData
        Untracked:  data/gene_set_collection_kegg.RData
        Untracked:  data/morandin_comparison_data/
        Untracked:  data/most.pheromone.sensitive.genes.rds
        Untracked:  data/queen_pheromone.db
        Untracked:  docs/figure/
        Untracked:  figures/
        Untracked:  manuscript/
        Untracked:  supplement/
    
    Unstaged changes:
        Modified:   .gitignore
        Deleted:    data/README.md
    
    
    Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.
Expand here to see past versions:
    File Version Author Date Message
    Rmd 934cda8 Luke Holman 2019-03-07 First commit of new website structure

Workflow

There are already sequenced genomes of A. mellifera and B. terrestris and L. niger in NCBI. We’ll have to make something for L. flavus from scratch.

We also need to determine possible isoforms for the genes in the data set, which we will do using tophat for the genomes that have references.

L. niger workflow

This specis has no isoform data in NCBI, so we create our own. First we create bowtie2 indexed for the reference genomes using bowtie2-build, and then tophat and cufflinks are executed using tophat.sh using reference-based mapping to identify alternative splicing.

L flavus workflow

trinity.sh assemble lf from raw reads. We are not using a genome guided assembly, since the ln genome is fragmented, as per recommendation of the trinity authors transdecoder.sh predict proteins from the transcripts.

Gene expression analysis

kallisto We use kallisto for gene expression analysis of transcripts from the NCBI data bases for A. mellifera and B. terrestis, for the TopHat assembly of L. niger and for the Trinity assembly of L. flavus.

rsem and ebseq This is the more traditional approach using the same data sources. We prepare references as ngvector files from the predicted transcripts, as per instructions.

Orthology

We handle this by reciprocal blastp hit on the proteins vs the honey bee genome (the best annotated of the bunch) e.g., blastp -num_threads 12 -query lf.fa -db amel -outfmt 6 -evalue 1e-4 -max_target_seqs 1

Individual scripts

We here present the individual scripts in code chunks, for convenient viewing.

ebseq.sh

#!/bin/bash
#SBATCH --job-name=ebseq
#SBATCH --partition=compute
#SBATCH --mem=500M
#SBATCH --time=1:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc


rsem=/apps/unit/MikheyevU/sasha/RSEM
refdir=./data/rsem
a=($(ls -1  ./data/raw_reads/trimmed/*_1.fastq.gz |awk -F/ '{print $NF}' |cut -c-2 |uniq)) #4
species=${a["SLURM_ARRAY_TASK_ID"]}
echo $species

# contrasts are relative to control treatments, i.e., control goes first

if [ "$species" == "bt" ]; then
    samples=`for i in 2 4 5 8 10 1 3 6 7 9; do echo -ne $refdir/$species$i".isoforms.results ";done`
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="5,5"
elif [ "$species" == "am" ]; then    
    samples=`for i in 2 4 6 1 3; do echo -ne $refdir/$species$i".isoforms.results ";done`
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="3,2"
elif [ "$species" == "lf" ]; then
    samples=`for i in 4 6 8 10 12 14 16 1 3 5 7 11 13; do echo -ne $refdir/$species$i".isoforms.results " ;done`
    echo $samples
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="7,6"
elif [ "$species" == "ln" ]; then
    samples=`for i in 1 3 5 7 11 2 4 6 8  12; do echo -ne $refdir/$species$i".isoforms.results ";done`
    $rsem/rsem-generate-data-matrix $samples > $refdir/$species.matrix
    conditions="5,5"
fi

$rsem/rsem-generate-data-matrix `echo $samples | sed 's/isoforms/genes/g'` > $refdir/$species.genes.matrix
$rsem/rsem-run-ebseq $refdir/$species.genes.matrix $conditions $refdir/$species".genes.ebseq"
$rsem/rsem-control-fdr $refdir/$species".genes.ebseq" .05 $refdir/$species".genes.padj.ebseq"

$rsem/rsem-run-ebseq --ngvector ./ref/$species.ngvec $refdir/$species.matrix $conditions $refdir/$species".ebseq"
$rsem/rsem-control-fdr $refdir/$species".ebseq" .05 $refdir/$species".padj.ebseq"

kallisto.sh

#!/bin/bash
#SBATCH --job-name=kalisto
#SBATCH --partition=compute
#SBATCH --mem=3G
#SBATCH --time=5:00:00
#SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc


module load kallisto

##species=lf #am, bt, ln
##SLURM_ARRAY_TASK_ID=0
#for sbatch: --export=species=am
#10 for bt, 6 for am, 12 for ln, 15 lf
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz) 
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`

if [ "$species" == "bt" ]; then
    ref=./ref/GCF_000214255.1_Bter_1.0_rna
elif [ "$species" == "am" ]; then
    ref=./ref/GCF_000002195.4_Amel_4.5_rna
elif [ "$species" == "lf" ]; then
    ref=./ref/lf
elif [ "$species" == "ln" ]; then
    ref=./ref/ln
fi

kallisto quant -i $ref -o data/kallisto/$base -b 100 $f $r 

orthodb.py

# takes a list of protein sequences and determines their orthogroup ids
#with python/2.7.8  
#takes two files, source of fasta and output
import urllib, json, time, sys
from Bio import SeqIO
urlbase = "http://www.orthodb.org/blast?"
outfile = open(sys.argv[2], "w", 0)
level = "level=7434" #Aculeata
count = 1
start = time.time()
for rec in SeqIO.parse(sys.argv[1], "fasta"):       
    while True:             
        try:
            response = urllib.urlopen(urlbase + level  + "&seq=" + str(rec.seq).replace("*",""))
            data = response.read()
            parsedData = json.loads(data)
        except:
            print "http error"
            time.sleep(2)
        else:
            break
    if parsedData['data']:
        outfile.write(rec.id + "\t" + ",".join(parsedData['data']) + "\n")
    count += 1
    if count % 100 == 0:
        print("%i %.2f" % (count, time.time() - start))
        start = time.time()

rsem-generate-data-matrix

#!/usr/bin/perl

use strict;

if (scalar(@ARGV) == 0) {
    print "Usage: rsem-generate-data-matrix sampleA.[genes/isoforms].results sampleB.[genes/isoforms].results ... > output_name.matrix\n";
    print "Results files should be either all .genes.results or all .isoforms.results.\n";
    exit(-1);
}

my $offsite = 5; # ###### change to TPM

my $line;
my $n = scalar(@ARGV);
my $M = -1;
my @matrix = ();

# 0, file_name; 1, reference of expected count array; 2, reference of transcript_id/gene_id array
sub loadData {
    open(INPUT, $_[0]);
    my $line = <INPUT>; # The first line contains only column names
    while ($line = <INPUT>) {
    chomp($line); 
    my @fields = split(/\t/, $line);
    push(@{$_[2]}, "\"$fields[0]\"");
    push(@{$_[1]}, $fields[$offsite]);
    }
    close(INPUT);

    if (scalar(@{$_[1]}) == 0) {
    print STDERR "Nothing is detected! $_[0] may not exist or is empty.\n";
    exit(-1);
    }
}

#0, M; 1, reference of @ids_arr; 2, reference of @ids
sub check {
    my $size = $_[0];
    for (my $i = 0; $i < $size; $i++) { 
    if ($_[1]->[$i] ne $_[2]->[$i]) {
        return 0;
    }
    }
    return 1;
}

my @ids_arr = ();

for (my $i = 0; $i < $n; $i++) {
    my (@ids, @ecs) = ();
    &loadData($ARGV[$i], \@ecs, \@ids);

    if ($M < 0) { 
    $M = scalar(@ids); 
    @ids_arr = @ids;
    }
    elsif (!&check($M, \@ids_arr, \@ids)) { 
    print STDERR "Number of lines among samples are not equal!\n"; 
    exit(-1); 
    }

    my $colname;
    if (substr($ARGV[$i], 0, 2) eq "./") { $colname = substr($ARGV[$i], 2); }
    else { $colname = $ARGV[$i]; }
    $colname = "\"$colname\"";
    @ecs = ($colname, @ecs);
    push(@matrix, \@ecs);
}

@ids_arr = ("", @ids_arr);
@matrix = (\@ids_arr, @matrix);

for (my $i = 0; $i <= $M; $i++) {
    for (my $j = 0; $j < $n; $j++) { print "$matrix[$j][$i]\t"; }
    print "$matrix[$n][$i]\n";
}

rsem.sh

#!/bin/bash
#SBATCH --job-name=rsem
#SBATCH --partition=compute
#SBATCH --mem=4G
#SBATCH --time=2-00:00:00
#SBATCH --cpus-per-task=10
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc


module load rsem bowtie2

cd data/rsem
refdir=../raw_reads/trimmed
a=($refdir/*_1.fastq.gz) #43
b=($refdir/*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
species=`basename $f | cut -c-2`
base=`basename $f _1.fastq.gz`

if [ "$species" == "bt" ]; then
    ref=../../ref/bt
elif [ "$species" == "am" ]; then
    ref=../../ref/am
elif [ "$species" == "lf" ]; then
    ref=../../ref/lf
elif [ "$species" == "ln" ]; then
    ref=../../ref/ln
fi

echo rsem-calculate-expression  -p 8 --paired-end --bowtie2 $f $r $ref $base
/apps/unit/MikheyevU/sasha/RSEM/rsem-calculate-expression -p 10 --bowtie2  --paired-end $f $r $ref $base

tophat.sh

#!/bin/bash
#SBATCH --job-name=tophat
#SBATCH --partition=compute
##SBATCH --mem=1G
#SBATCH --time=1-00:00:00
##SBATCH --cpus-per-task=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc

#species=ln  #am, bt, ln
#for sbatch: --export=species=am
refdir=./data/raw_reads/trimmed
a=($refdir/$species*_1.fastq.gz) #10 for bt, 6 for am, 12 for ln
b=($refdir/$species*_2.fastq.gz)
f=${a["SLURM_ARRAY_TASK_ID"]}
r=${b["SLURM_ARRAY_TASK_ID"]}
base=`basename $f _1.fastq.gz`

module load bowtie2/2.2.6 tophat/2.1.1 cufflinks/2.2.1

if [ "$species" == "bt" ]; then
    gff=./ref/GCF_000214255.1_Bter_1.0_genomic.gff
    ref=./ref/bt
elif [ "$species" == "am" ]; then
    gff=./ref/GCF_000002195.4_Amel_4.5_genomic.gff
    ref=./ref/am
else
    gff=./ref/GCA_001045655.1_ASM104565v1_genomic.gff 
    ref=./ref/ln
fi

echo $SLURM_ARRAY_TASK_ID $species $gff $ref

echo  $species $ref  $base $gff $f $r 
#tophat2 -p 1 -G $gff -o ./data/assembly/tophat/$base $ref $f $r
cufflinks -p 1 -g $gff ./data/assembly/tophat/$base/accepted_hits.bam -o ./data/assembly/tophat/$base

transdecoder.sh

#!/bin/bash
#SBATCH --job-name=transdecoder
#SBATCH --partition=compute
#SBATCH --mem=10G
#SBATCH --cpus-per-task=10
#SBATCH --time=1-00:00:00
#SBATCH --ntasks=1
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err

##SLURM_ARRAY_TASK_ID=2

module load Trinity/2.1.1 

infile=./data/assembly/trinity_lf/Trinity.fasta

base=lf

TransDecoder -t $infile --reuse --workdir ./output_"$base" --search_pfam /apps/unit/MikheyevU/sasha/TransDecoder_r20140704/pfam/Pfam-AB.hmm.bin --CPU 10 -v

trimmomatic.sh

#!/bin/bash
#SBATCH --job-name=trim
#SBATCH --partition=compute
#SBATCH --mem=20G
#SBATCH --cpus-per-task=8
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err

trimmomatic=/apps/unit/MikheyevU/sasha/trimmomatic
##SLURM_ARRAY_TASK_ID=1
a=(*/*_1.fastq.gz) # 43
b=(*/*_2.fastq.gz) # 43
f=${a["SLURM_ARRAY_TASK_ID"]}   
r=${b["SLURM_ARRAY_TASK_ID"]}   
base=$(basename $f _1.fastq.gz)
java -jar $trimmomatic/trimmomatic-0.32.jar PE -threads 8 -phred33 $f $r trimmed/$base"_1.fastq.gz" trimmed/$base"_unpaired1.fastq.gz" trimmed/$base"_2.fastq.gz" trimmed/$base"_unpaired2.fastq.gz" ILLUMINACLIP:$trimmomatic/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:25

trinity.sh

#!/bin/bash
#SBATCH --job-name=trinity
#SBATCH --partition=compute
#SBATCH --mem=80G
#SBATCH --time=3-00:00:00
#SBATCH --cpus-per-task=12
##SBATCH --mail-user=%u@oist.jp
##SBATCH --mail-type=BEGIN,FAIL,END
#SBATCH --input=none
#SBATCH --output=%j.out
##SBATCH --error=job_%j.err
. $HOME/.bashrc

species=lf
refdir=../data/raw_reads/merged
module load Trinity/2.1.1  bowtie/1.1.0
Trinity --seqType fq --max_memory 75G --left $refdir/"$species"_1.fastq.gz \
 --right  $refdir/"$species"_2.fastq.gz  \
    --CPU 12 --output ../data/assembly/trinity_"$species"

Session information

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] workflowr_1.1.1   Rcpp_1.0.0        lattice_0.20-35  
 [4] digest_0.6.18     rprojroot_1.3-2   R.methodsS3_1.7.1
 [7] grid_3.5.1        jsonlite_1.6      backports_1.1.2  
[10] git2r_0.23.0      magrittr_1.5      evaluate_0.11    
[13] stringi_1.3.1     whisker_0.3-2     R.oo_1.22.0      
[16] R.utils_2.7.0     Matrix_1.2-14     reticulate_1.10  
[19] rmarkdown_1.10    tools_3.5.1       stringr_1.3.1    
[22] yaml_2.2.0        compiler_3.5.1    htmltools_0.3.6  
[25] knitr_1.20       

This reproducible R Markdown analysis was created with workflowr 1.1.1