Some times, one might have to split columns in data frames. It if it is a string and with delimiters in between, there are several packages to do that. Let us say you have a daframe with digits and NA. You would like to split them. Let me take an example from web:
query:
id rs143 rs148 rs149 rs1490
1 02003s NA 11 22 11
2 02003s NA 10 11 22
3 02003s NA 11 11 12
4 02003s NA 10 11 11
5 02003s NA 10 11 11
Expected:
id rs143 rs143.1 rs148 rs148.1 rs149 rs149.1 rs1490 rs1490.1
1 02003s NA NA 1 1 2 2 1 1
2 02003s NA NA 1 0 1 1 2 2
3 02003s NA NA 1 1 1 1 1 2
4 02003s NA NA 1 0 1 1 1 1
5 02003s NA NA 1 0 1 1 1 1
Solution1:
=======================
library(stringr)
test=read.csv("test2.txt", stringsAsFactors = F, header = T, sep="\t")
new_test = cbind(test[1], as.data.frame(sapply(test[,-1], function (x)
if (is.integer(x)) {
as.data.frame(str_split_fixed(as.character(x), "", 2))
}
else {
replicate(2, x)
})))
===========================
Output:
===================
> new_test
id rs143.1 rs143.2 rs148.V1 rs148.V2 rs149.V1 rs149.V2 rs1490.V1 rs1490.V2
1 02003s NA NA 1 1 2 2 1 1
2 02003s NA NA 1 0 1 1 2 2
3 02003s NA NA 1 1 1 1 1 2
4 02003s NA NA 1 0 1 1 1 1
5 02003s NA NA 1 0 1 1 1 1
==========================
Let us assume that column names must be as requested by OP in the post:
=========================
test=read.csv("test2.txt", stringsAsFactors = F, header = T, sep="\t")
df = as.data.frame(cbind(test[1],lapply(test[, -1], function (x)
if (is.integer(x)) {
do.call(rbind, strsplit(as.character(x), ""))
}
else {
replicate(2, x)
})))
names(df)=gsub("\\.1","",names(df))
names(df)=gsub("\\.2","\\.1",names(df))
=======================
Output:
==================================
> df
id rs143 rs143.1 rs148 rs148.1 rs149 rs149.1 rs1490 rs1490.1
1 02003s NA NA 1 1 2 2 1 1
2 02003s NA NA 1 0 1 1 2 2
3 02003s NA NA 1 1 1 1 1 2
4 02003s NA NA 1 0 1 1 1 1
5 02003s NA NA 1 0 1 1 1 1
===================================
query:
id rs143 rs148 rs149 rs1490
1 02003s NA 11 22 11
2 02003s NA 10 11 22
3 02003s NA 11 11 12
4 02003s NA 10 11 11
5 02003s NA 10 11 11
Expected:
id rs143 rs143.1 rs148 rs148.1 rs149 rs149.1 rs1490 rs1490.1
1 02003s NA NA 1 1 2 2 1 1
2 02003s NA NA 1 0 1 1 2 2
3 02003s NA NA 1 1 1 1 1 2
4 02003s NA NA 1 0 1 1 1 1
5 02003s NA NA 1 0 1 1 1 1
Now, as you see, each column must be split and names should be appended to each new column. There are several ways to address this problem. One of the highest voted methods considered column names for splitting. I took a different approach. First check if the column values are integers or not. If integer, split them and if they are not integers (most probably character), replicate (duplicate) the original values. Of course, some people are particular about column names and some do not care. Both the solutions are provided below:
Solution1:
=======================
library(stringr)
test=read.csv("test2.txt", stringsAsFactors = F, header = T, sep="\t")
new_test = cbind(test[1], as.data.frame(sapply(test[,-1], function (x)
if (is.integer(x)) {
as.data.frame(str_split_fixed(as.character(x), "", 2))
}
else {
replicate(2, x)
})))
===========================
Output:
===================
> new_test
id rs143.1 rs143.2 rs148.V1 rs148.V2 rs149.V1 rs149.V2 rs1490.V1 rs1490.V2
1 02003s NA NA 1 1 2 2 1 1
2 02003s NA NA 1 0 1 1 2 2
3 02003s NA NA 1 1 1 1 1 2
4 02003s NA NA 1 0 1 1 1 1
5 02003s NA NA 1 0 1 1 1 1
==========================
Let us assume that column names must be as requested by OP in the post:
=========================
test=read.csv("test2.txt", stringsAsFactors = F, header = T, sep="\t")
df = as.data.frame(cbind(test[1],lapply(test[, -1], function (x)
if (is.integer(x)) {
do.call(rbind, strsplit(as.character(x), ""))
}
else {
replicate(2, x)
})))
names(df)=gsub("\\.1","",names(df))
names(df)=gsub("\\.2","\\.1",names(df))
=======================
Output:
==================================
> df
id rs143 rs143.1 rs148 rs148.1 rs149 rs149.1 rs1490 rs1490.1
1 02003s NA NA 1 1 2 2 1 1
2 02003s NA NA 1 0 1 1 2 2
3 02003s NA NA 1 1 1 1 1 2
4 02003s NA NA 1 0 1 1 1 1
5 02003s NA NA 1 0 1 1 1 1
===================================