Biologist's bioinformatics notes

One of the most popular work flows in RNAseq data analysis is Tuxedo work flow 1 where aligner is Tophat, transcript assembler and quantifier is cufflinks and downstream analysis in R is cummeRbund. Later on same group released Tuxedo work flow 2 which has HISAT2 as aligner, StringTie as transcript assembler and quantifier and Ballgown is downstream analysis package in R.

There are many people who pursued alternative workflows. One of the most popular alternative work flows is Tophat-HTseq-DESeq2/edgeR/other packages in R. One such work flow involving Tophat-HTseq-DESeq2 is documented here in my blog. There are several changes in recent times esp in assembly based RNAseq data analysis. HISAT2 replaced Tophat and Featurecounts is faster than HTSeq. So let us construct a work flow that involves hisat2, featurecounts and DEseq2.

Working with HISAT2 is documented in here in my earlier blog. Using the work flow in the blog, work till "Bam file statistics" section. Once you reached that point, you will have 6 bam files (3 bam files for normal and 3 bam files for tumor). Link to example files and scripts are furnished in the same blog.

Now let us move on to featurecounts. Featurecounts outputs raw counts for each feature, provided in the work flow.

$ featureCounts -T 6 -p -t exon -g gene_id -a ~/reference/hg38/hg38_ucsc_refseq_chr22.gtf -s 2 --donotsort -o counts2.txt -M ../hisat2/*cut*.bam

Remember this work can be parallelized as each counts is independent of the others. However, featurecounts is wicked fast for small number of files. Hence I didn't parallelize it like other steps.

-T is threads (6 threads here)

-p is to say the bam is paired end bam and count fragments instead of reads

-t feature is exon i.e read count by exon

-g group features (exons here) by gene_id

-a reference gtf

-s strandedness. Here it is reverse stranded (2). This you would get it from library preperation.

-o output

-M since it is a RNAseq, a read can overlap with multiple transcripts. Hence we instruct the program to map read to multiple features.Summary of the output will be saved as counts2.txt.summary.

Some scholars would say not to count multimapping reads. But there are instances where you would need such a count. There are tools to count the mutlimapping reads in RNAseq (eg. mmquant: how to count multi-mapping reads?). User may choose this option as it suits the analysis.

Now that we got raw counts for each exon and grouped by gene, let us move onto differential expression. My assumption is that user has R and DESeq2 library is installed on the machine that would be used for analysis.

Below is the DESeq2 analysis. Please note that DESeq2 workflow is fairly well documented and i would cover only data import, meta data creation and making DESeq2 object. After that user can follow DESeq2 work flow document here. Follow from Differential expression analysis section onwards. This script partially overlaps with the bioconductor workflow so that user will find it easy to follow further.

=========================================================

## Load DESeq2, stringr and ggplot2 libraries

> library(DESeq2)

> library(stringr)

> library(ggplot2)

## Read the output from featurecounts. First set the working directory, then read the output from featurecounts (counts2.txt here) and save the object.

> setwd("~/example_data/practice_rnaseq_data/featurecounts/")

## Load the file and use GeneId column as row names

counts=read.csv("counts2.txt", sep="", head=T, skip=1, row.names = "Geneid")

## Print the names of columns from 6 to 11. Please note that the column names are like "...hisat2.hcc1395_normal_rep1.cutadapt.bam"

> colnames(counts)[6:11]

## Let us edit (rename) the column names to reflect sample names. This would be done by splitting column names (i.e from "...hisat2.hcc1395_normal_rep1.cutadapt.bam" to "hcc1395_normal_rep1"), using str_split_fixed function. Split the column names using dot (.) as delimiter, into 6 values and printing the the fifth value. Fifth value contains the sample name.

> colnames(counts)[6:11]=str_split_fixed(colnames(counts)[6:11],"\\.",6)[,5]

##Now make sure that column names are changed

> colnames(counts)[6:11]

## Create a sample data frame with sample names as row names and sample types as condition column

> samples = data.frame (row.names = colnames (counts) [6:11], condition = str_split_fixed (colnames(counts) [6:11],"_",3)[,2])

## Check if sample names created match with the column names of featurecounts. both should match ( TRUE)

> all (rownames(samples) %in% colnames(counts))

## create DESeq2 object

> dds=DESeqDataSetFromMatrix (countData = counts[,6:11], colData = samples, design = ~ condition)

## Let us say we want to see how the groups come together, we can use PCA plot on

## DESeq2 object. You would observe that PCA 1 it self separates data into two groups.

> plotPCA(rlog(dds), intgroup="condition") + theme_bw()

## DESeq function is a wrapper for standard differential expression analysis steps. Hence ## let us run the script on DESeq2 object created above (dds)

> ddds=DESeq(dds)

## Let us extract results and store it in a different object

> res_ddds=results(ddds)

## lfcShrink outputs shrunken log2 fold changes. One can provide either coefficient or the condition. Below example has both. User can take either one.

> lfcshrink_res_ddds <- lfcShrink(ddds, coef=2, res=res_ddds)

> lfcshrink_res_ddds1 <- lfcShrink(ddds, contrast=c("condition", "tumor","normal"), res=res_ddds)

## Sort the fold changes by adjusted p-value

> lfcshrink_res_ddds_ordered=lfcshrink_res_ddds[order(lfcshrink_res_ddds$padj),]

## Now plot the fold changes.

> plotMA(lfcshrink_res_ddds_ordered)

## We can plot counts of gene of lowest adjusted p-value i.e gene with highest statistical ## significance between the two groups.

> plotCounts(ddds, gene=which.min(res_ddds$padj), intgroup="condition", pch=2, col=ddds$condition, cex=3, transform = T)

=========================================================

User can follow further work flow from here.(link to bioconductor DESeq2 worklow).

Recent Posts

Links

Nov 24, 2017 - RNA seq analysis - FeatureCounts and DESeq2 workflow