In R, there are several ways to collapse columns (from wide format) to limited number of columns (to long format). There are certain cases where one needs to separate values (for eg. genes, transcripts, annotations etc) in a cell of a column and append it to appropriate gene. For eg. let us say a gene has 3 different GO terms. In table with two columns, one column has gene and another column has 3 GO terms in a single cell, but separated by , or ; or any other delimiter. Now we would like to transform this data in such a way that first column should contain gene name and in the second column, each row should have only one GO term as described below:
Before
Gene X | GO term 1, GO term2, GO term3
After
Gene X | GO term 1
Gene X | GO term 1
Gene X | GO term 1
Let me take an example with a post from biostars:
Gene | GO_terms
ENO | GO:0000015^GO:0000287^GO:0004634^GO:0006096
CCYL1 | GO:0000079
SAP30 | GO:0000118^GO:0003677^GO:0004407^GO:0046872^GO:0006351
Expected output is:
Gene GO_terms
1: ENO GO:0000015
2: ENO GO:0000287
3: ENO GO:0004634
4: ENO GO:0006096
5: CCYL1 GO:0000079
6: SAP30 GO:0000118
7: SAP30 GO:0003677
8: SAP30 GO:0004407
9: SAP30 GO:0046872
10: SAP30 GO:0006351
R code for this is:
========================
test=read.csv("test.txt", header=T, stringsAsFactors = F, sep="\t")
test$GO_terms=gsub('\\^',";", test$GO_terms)
test2=cSplit(test, "GO_terms", ";", "long")
===================
test.txt is original file. Substitution is necessary because csplit function doesn't consider ^ as separator with a reason. In general, ^ is used to denote start in regex. Hence convert ^ to a normal/accepted delimiter and then use csplit function. Had there been normal delimiter (such as , ; :) then delimiter substitution is not necessary. One can directly use csplit.