There are several tutorials on generating heatmaps using microarray, Exon array and tools that generate differential data. In this note, let us explore simplest way to create heatmap with RNAseq data. Post DESeq2 analysis (RNASeq HISAT2-Featurecounts-DESeq2 work flow outlined here), you might have arrived at a list of genes with statistical significance with expression (fold change cutoff). For generating heatmap, do following:
- Extract genes (features of interest). From DESeq2 one would get Fold change, p-value, adjusted p-value etc. Now extract only genes
- Subset read count (normalized and available within DESeq2) for these genes only.
You can download the file from here: This data is DESeq2 normalized. There is another component needed i.e sample demography i.e grouping information such as age, condition, gender etc. Let us deal with very simple one i.e samples with one grouping information: Disease (Turmor and Normal). This meta information is available for these samples. Sample Names are in columns and genes are in rows. Column names of expression data ( with read counts) should match that from row names of demography data as the sample names should match between both the data (expression data and demography data). In R parlance, column names of expression data should match the row names of sample demography (meta data). In this example, you can observe the same.
condition
hcc1395_normal_rep1 normal
hcc1395_normal_rep2 normal
hcc1395_normal_rep3 normal
hcc1395_tumor_rep1 tumor
hcc1395_tumor_rep2 tumor
hcc1395_tumor_rep3 tumor
hcc1395_normal_rep1 normal
hcc1395_normal_rep2 normal
hcc1395_normal_rep3 normal
hcc1395_tumor_rep1 tumor
hcc1395_tumor_rep2 tumor
hcc1395_tumor_rep3 tumor
What we are going to do is to plot the expression values and group the samples by condition.
code (I am not showing how the data is loaded into R. It comes from either analysis with in R or from an external file):
===================================================
$ library(pheatmap)
$ pheatmap(as.matrix(log.rc.fc.sig.lfcshrink_res_ddds_ordered), scale = "row", clustering_distance_rows = "correlation", clustering_method = "complete",color = c("green", "red"),annotation_col = samples, labels_col = str_replace(rownames(samples),"hcc.[1-9]+_",""), main="Significant genes", fontsize_col=24, fontsize_row = 4)
================================================================
Explanation of the command:
- as.matrix - convert the data to matrix
- scale - scale the data by row
- clustering_distance_rows - now the samples must be clustered by correlation. Distance estimate can be changed for eg. by Euclidean distance
- Clustering_method - there are 2-3 methods. How clustering can be done: for eg complete, average, nearest etc. Here, it is complete
- Color - color gradient for expression values. Here I have used only two. You can use as many as you want. But the graph becomes too clumsy. Use intuition in color selection.
- annotation_col - this is very important. Now we want to column (samples) data to be clustered as per the condition. For this data we provide is sample meta data data frame.
- labels_col - column labels. We can supply any name that we wish to. But they should match with number of columns. In the above code, we are chopping of sample names so that they fit well in graph
- main - title of the graph
- fontsize_col - column font size (For sample names, I put large number, as they are small in number)
- fontsize_row - row font size (For gene names, I put small number as they are many in number)
Heatmap would like this: