Biologist's bioinformatics notes

In R, there are several ways to collapse columns (from wide format) to limited number of columns (to long format). There are certain cases where one needs to separate values (for eg. genes, transcripts, annotations etc) in a cell of a column and append it to appropriate gene. For eg. let us say a gene has 3 different GO terms. In table with two columns, one column has gene and another column has 3 GO terms in a single cell, but separated by , or ; or any other delimiter. Now we would like to transform this data in such a way that first column should contain gene name and in the second column, each row should have only one GO term as described below:

Before

Gene X | GO term 1, GO term2, GO term3

After

Gene X | GO term 1

Let me take an example with a post from biostars:

Gene | GO_terms

ENO | GO:0000015^GO:0000287^GO:0004634^GO:0006096

CCYL1 | GO:0000079

SAP30 | GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351

Expected output is:

Gene GO_terms

1: ENO GO:0000015

2: ENO GO:0000287

3: ENO GO:0004634

4: ENO GO:0006096

5: CCYL1 GO:0000079

6: SAP30 GO:0000118

7: SAP30 GO:0003677

8: SAP30 GO:0004407

9: SAP30 GO:0046872

10: SAP30 GO:0006351

R code for this is:

========================

test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")

test$GO_terms=gsub('\\^',";", test$GO_terms)

test2=cSplit(test, "GO_terms", ";", "long")

===================

test.txt is original file. Substitution is necessary because csplit function doesn't consider ^ as separator with a reason. In general, ^ is used to denote start in regex. Hence convert ^ to a normal/accepted delimiter and then use csplit function. Had there been normal delimiter (such as , ; :) then delimiter substitution is not necessary. One can directly use csplit.

Recent Posts

Links

Sep 13, 2017 - Split and melt columns in R