Biologist's bioinformatics notes

Some times, one might have to split columns in data frames. It if it is a string and with delimiters in between, there are several packages to do that. Let us say you have a daframe with digits and NA. You would like to split them. Let me take an example from web:
query:
         id        rs143        rs148       rs149      rs1490
1    02003s         NA          11          22          11
2    02003s         NA          10          11          22
3    02003s         NA          11          11          12
4    02003s         NA          10          11          11
5    02003s         NA          10          11          11
Expected:
    id     rs143 rs143.1 rs148 rs148.1 rs149 rs149.1 rs1490   rs1490.1
1    02003s    NA    NA        1     1       2       2       1        1
2    02003s    NA    NA        1     0       1       1       2        2
3    02003s    NA    NA        1     1       1       1       1        2
4    02003s    NA    NA        1     0       1       1       1        1
5    02003s    NA    NA        1     0       1       1       1        1

Now, as you see, each column must be split and names should be appended to each new column. There are several ways to address this problem. One of the highest voted methods considered column names for splitting. I took a different approach. First check if the column values are integers or not. If integer, split them and if they are not integers (most probably character), replicate (duplicate) the original values. Of course, some people are particular about column names and some do not care. Both the solutions are provided below:

Solution1:
=======================
library(stringr)
test=read.csv("test2.txt", stringsAsFactors = F, header = T, sep="\t")

new_test = cbind(test[1], as.data.frame(sapply(test[,-1], function (x)
if (is.integer(x)) {
    as.data.frame(str_split_fixed(as.character(x), "", 2))
}
else {
    replicate(2, x)
})))
===========================
Output:
===================
> new_test
      id rs143.1 rs143.2 rs148.V1 rs148.V2 rs149.V1 rs149.V2 rs1490.V1 rs1490.V2
1 02003s      NA      NA        1        1        2        2         1         1
2 02003s      NA      NA        1        0        1        1         2         2
3 02003s      NA      NA        1        1        1        1         1         2
4 02003s      NA      NA        1        0        1        1         1         1
5 02003s      NA      NA        1        0        1        1         1         1
==========================

Let us assume that column names must be as requested by OP in the post:
=========================
test=read.csv("test2.txt", stringsAsFactors = F, header = T, sep="\t")
df = as.data.frame(cbind(test[1],lapply(test[, -1], function (x)
if (is.integer(x)) {
    do.call(rbind, strsplit(as.character(x), ""))
}
else {
    replicate(2, x)
})))

names(df)=gsub("\\.1","",names(df))
names(df)=gsub("\\.2","\\.1",names(df))
=======================

Output:
==================================
    > df
      id rs143 rs143.1 rs148 rs148.1 rs149 rs149.1 rs1490 rs1490.1
1 02003s    NA      NA     1       1     2       2      1        1
2 02003s    NA      NA     1       0     1       1      2        2
3 02003s    NA      NA     1       1     1       1      1        2
4 02003s    NA      NA     1       0     1       1      1        1
5 02003s    NA      NA     1       0     1       1      1        1
===================================

Recent Posts

Links

Apr 23, 2018 - Split columns in R data frame