De novo assembly using Trinity
Trinity is one of the most popular software package for efficient and robust de novo reconstruction of transcriptomes from RNA-Seq data. It consists of three software modules, Inchworm, Chrysalis and Butterfly, that run sequentially to process the sequencing reads.
Quote from Trinity GitHub:
- Inchworm assembles the RNA-seq data into the unique sequences of transcripts, often generating full-length transcripts for a dominant isoform, but then reports just the unique portions of alternatively spliced transcripts.
- Chrysalis clusters the Inchworm contigs into clusters and constructs complete de Bruijn graphs for each cluster. Each cluster represents the full transcriptonal complexity for a given gene (or sets of genes that share sequences in common). Chrysalis then partitions the full read set among these disjoint graphs.
- Butterfly then processes the individual graphs in parallel, tracing the paths that reads and pairs of reads take within the graph, ultimately reporting full-length transcripts for alternatively spliced isoforms, and teasing apart transcripts that corresponds to paralogous genes.
The Trinity developers have provided training materials, and the raw data and the software required are built into a VirtualBox image (Trinity2015.ova). I have saved a copy on ALPS1. The RNA-Seq data are 76 bp strand-specific Illumina RNA-Seq paired-end reads derived from Schizosaccharomyces pombe (fission yeast) grown under 4 conditions:
- logarithmic growth (Sp_log)
- plateau phase (Sp_plat)
- heat shock (Sp_hs)
- diauxic shift (Sp_ds)
* Due to the space limitation of gitbook, I will not provide the fq.gz files here, please obtain these files from the VirtualBox image [Link]
-rw-rw-r-- 1 ycl6 ycl6 5790168 Oct 27 11:35 RNASEQ_data/Sp_ds.left.fq.gz -rw-rw-r-- 1 ycl6 ycl6 5590326 Oct 27 11:35 RNASEQ_data/Sp_ds.right.fq.gz -rw-rw-r-- 1 ycl6 ycl6 5815390 Oct 27 11:35 RNASEQ_data/Sp_hs.left.fq.gz -rw-rw-r-- 1 ycl6 ycl6 5751383 Oct 27 11:36 RNASEQ_data/Sp_hs.right.fq.gz -rw-rw-r-- 1 ycl6 ycl6 2154125 Oct 27 11:36 RNASEQ_data/Sp_log.left.fq.gz -rw-rw-r-- 1 ycl6 ycl6 2097534 Oct 27 11:36 RNASEQ_data/Sp_log.right.fq.gz -rw-rw-r-- 1 ycl6 ycl6 5488286 Oct 27 11:36 RNASEQ_data/Sp_plat.left.fq.gz -rw-rw-r-- 1 ycl6 ycl6 5238362 Oct 27 11:36 RNASEQ_data/Sp_plat.right.fq.gz
- v2.2.0 [17 Mar 2016] - Latest version available at the time of writing and used in this exercise
- v2.0.6 [13 Mar 2015] - Latest version available on ALPS1
- v1.1.2 [23 Jun 2015] - Latest version available at the time of writing and used in this exercise
- v1.0.1 [14 Mar 2014] - Latest version available on ALPS1
GMAP (Genomic Mapping and Alignment Program)
- v2016-09-23 - Latest version available at the time of writing and used in this exercise
STAR (Spliced Transcripts Alignment to a Reference)
- v2.5.2b [20 Aug 2016] - Latest version available at the time of writing and used in this exercise
- v2.3.0e [14 Feb 2013] - Latest version available on ALPS1
- v1.3.1 [22 Apr 2016] - Latest version available at the time of writing and used in this exercise
- v1.2 [02 Feb 2015] - Latest version available on ALPS1
RSEM (RNA-Seq by Expectation-Maximization)
- v1.3.0 [02 Oct 2016] - Latest version available at the time of writing
- v1.2.31 [04 Jun 2016] - Version used in this exercise
- v1.2.19 [05 Nov 2014] - Latest version available on ALPS1
Set JAVA_HOME and PATH
Bowtie 1 (NOT Bowtie 2) is required by the Chrysalis module.
* Below is an example showing how to set up the paths, please remember to change the paths to these binaries accordingly.
cd ~/ export JAVA_HOME=/pkg/java/jdk1.7.0_51/bin/java export PATH=/pkg/java/jdk1.7.0_51/bin:/pkg/biology/Bowtie/bowtie-1.0.1:\ /work3/LSLNGS2015/Tools/RSEM-1.2.23:/pkg/biology/R/R-3.1.2/bin:\ /work3/LSLNGS2015/Tools/gmap-2015-09-29/bin:/pkg/biology/samtools/samtools-1.2:\ /work3/LSLNGS2015/Tools/STAR-STAR_2.4.2a/bin/Linux_x86_64_static:\ /pkg/biology/trinity/trinityrnaseq-2.0.6:$PATH
You can use
echo $PATH to check the new PATH variable.