4 RNA-seq data analysis

Many resources and introductory tutorials to typical RNA-seq workflows exist on the web. The choice of tools and computational methods is often a matter of choice. Generally R/Bioconductor is a great resource for HTS-data related tutorials.

Often the best approach is to locally reproduce results discussed as part of different tools. For example, most R/Bioconductor packages provide sample analyses and case studies as part of their vignettes.

4.1 General workflow

A typical RNA-seq data analysis workflow may consist of (some or all of) the following steps (in brackets are tools that I have used in the past):

  1. Quality assessment of raw reads (fastqc)
  2. Adapter trimming and quality filtering (cutadapt, trimmomatic, bbduk.sh)
  3. Read alignment to reference genome (bowtie2, tophat, bwa-mem, STAR)
  4. Quality assessment of mapped reads (RseQC, QoRTs), QualiMap
  5. Read de-duplication (samtools, picard-tools)
  6. Count summarisation (HTSeq, featureCounts)
  7. Differential gene expression analysis (DESeq2, limma, edgeR)
  8. Gene ontology enrichment analysis/Gene set enrichment analysis (clusterProfiler, GSEA)

Dündar et al. provide detailed notes from a course on RNA-seq data analysis at Cornell University. Notes include detailed instructions on how to install and run different tools from the command line and within R, and how to interpret results. It’s targeted at biologists and clinical researchers with a minimal knowledge of bioinformatics programming and high-throughput sequencing data analysis. Slides, sample files and the link to the GitHub project page can be found on the course website.


Conesa et al. have published a survey of best practices for RNA-seq data analysis that provides a nice overview of typical analysis steps and how to critically assess results.


Williams et al. have published a quantitative assessment of different RNA-seq analysis workflows using a large variety of read aligners and differential gene expression analysis tools.


Griffith lab RNA-seq tutorial

Griffith et al. provide an online Wiki-style RNA-seq analysis tutorial with exercises.


Daniel Cook - Awesome Bioinformatics

Daniel Cook maintains a list of Bioinformatics tools and programs on github.

4.2 Differential gene expression (DGE) analysis

I can specifically recommend the following three tutorials/vignettes surrounding the R packages DESeq2 and limma. In terms of the underlying statistical model, there exists a lot of overlap between DESeq2 and limma. Whether to use one over the other is therefore largely a matter of choice. Both methods are often regarded as the gold-standard methods in regards to DGE analyses.


http://www.bioconductor.org/help/workflows/rnaseqGene/

R/Bioconductor has a good tutorial on how to get from raw RNA-seq data to identifying differentially expressed genes.


DESeq2

The vignette is regularly updated and offers a great tutorial on how to perform differential gene expression analyses within R. The vignette also provides details on the underlying statistical modelling approach. Personally, I slightly favour the DESeq2 documentation and approach in my own research analyses.


Limma

The user guide offers an excellent tutorial using a wide range of different sample case studies on performing differential gene expression analyses within R.

4.3 Fusion gene identification

Various fusion/chimeric transcript identification tools exist.

Liu et al. performed a similar benchmark analysis of 15 fusion detection tools in their 2015 Nucleic Acids Research paper.


Another list of 39 (as of December 2016) fusion detection tools in given on Biostars.


InFusion

InFusion is one of the most recently published tools. It was written in C++/Python, and provides detailed instructions on how to compile the source code, and run the program.

4.4 Read simulation

Simulating reads for typical high-throughput sequencing experiments is often useful for assessing and benchmarking novel computational methods. Tools specialise in the simulation of reads from different library protocols (e.g. RNA-seq, whole-genome and exome sequencing) and different sequencing technologies (e.g. Illumina, SOLiD, 454, PacBio).


Escalona et al., provide a recent (2016) comparison of tools for the simulation of HTS data. This review is a good starting point to explore different methods, by following the references provided.


Flux Simulator is a read simulator for RNA-seq experiments. I have used Flux Simulator in the past to e.g. simulate small RNA and miRNA reads, for assessing the performance of a computational miRNA identification method.