Some times, for some obscure reason, you are asked to generate strings of letters from a larger string with restriction on string size and overlapping Window. For eg. You are asked to generate 3 letter word/string from “Apple” with one alphabet overlap. From Apple, you would be generating, App, ppl, ple (one character overlap). You also would like to put in a data fram for further manipulation. Here is an example with radom letters of DNA code (ATCG).
print ("string is ATCGATCGGGTTTAC")
[1] “string is ATCGATCGGGTTTAC”
Now the requirement is to print 6 letter strings, with one letter overlap. There are many ways to do it. Let us do it in a simple way.
library(stringr)
str_match_all("ATCGATCGGGTTTAC", "(?=(.{6}))")[[1]][,2]
[1] “ATCGAT” “TCGATC” “CGATCG” “GATCGG” “ATCGGG” “TCGGGT” “CGGGTT” “GGGTTT”
[9] “GGTTTA” “GTTTAC”
Now that we did this, this is temporary stop gap. Now let us make a general function, where user would provide a character vector (with long string), length of required of word and overlap window.
seq="ATCGATCGGGTTTAC"
extract_kmers=function(Seq,Len,Wind){
df=str_sub(seq,seq(1,nchar(seq)-(Len-1),Len-Wind),seq(Len,nchar(seq),Len-Wind))
return(as.data.frame(table(df, dnn = "String"), responseName="Counts"))
}
Now let us extract 6 letter length (hexamers) words with one character overlap
extract_kmers(seq,2,1)
String Counts
- AC 1
- AT 2
- CG 2
- GA 1
- GG 2
- GT 1
- TA 1
- TC 2
- TT 2
Now let us extract 6 letter length (hexamers) words with two character overlap
extract_kmers(seq,6,2)
String Counts
- ATCGAT 1
- ATCGGG 1
- GGTTTA 1
Now let us extract 6 letter length (hexamers) words with 5 character overlap
extract_kmers(seq,6,5)
String Counts
- ATCGAT 1
- ATCGGG 1
- CGATCG 1
- CGGGTT 1
- GATCGG 1
- GGGTTT 1
- GGTTTA 1
- GTTTAC 1
- TCGATC 1
- TCGGGT 1
Now let us extract 8 letter length (hexamers) words with one character overlap
extract_kmers(seq,8,2)
String Counts
- ATCGATCG 1
- CGGGTTTA 1