Biologist's bioinformatics notes

Some times, for some obscure reason, you are asked to generate strings of letters from a larger string with restriction on string size and overlapping Window. For eg. You are asked to generate 3 letter word/string from “Apple” with one alphabet overlap. From Apple, you would be generating, App, ppl, ple (one character overlap). You also would like to put in a data fram for further manipulation. Here is an example with radom letters of DNA code (ATCG).

print ("string is ATCGATCGGGTTTAC")

[1] “string is ATCGATCGGGTTTAC”

Now the requirement is to print 6 letter strings, with one letter overlap. There are many ways to do it. Let us do it in a simple way.

library(stringr)
str_match_all("ATCGATCGGGTTTAC", "(?=(.{6}))")[[1]][,2]

[1] “ATCGAT” “TCGATC” “CGATCG” “GATCGG” “ATCGGG” “TCGGGT” “CGGGTT” “GGGTTT”

[9] “GGTTTA” “GTTTAC”

Now that we did this, this is temporary stop gap. Now let us make a general function, where user would provide a character vector (with long string), length of required of word and overlap window.

seq="ATCGATCGGGTTTAC"
extract_kmers=function(Seq,Len,Wind){
    df=str_sub(seq,seq(1,nchar(seq)-(Len-1),Len-Wind),seq(Len,nchar(seq),Len-Wind))
    return(as.data.frame(table(df, dnn = "String"), responseName="Counts"))
}

Now let us extract 6 letter length (hexamers) words with one character overlap

extract_kmers(seq,2,1)

String Counts

AC 1
AT 2
CG 2
GA 1
GG 2
GT 1
TA 1
TC 2
TT 2

Now let us extract 6 letter length (hexamers) words with two character overlap

extract_kmers(seq,6,2)

String Counts

ATCGAT 1
ATCGGG 1
GGTTTA 1

Now let us extract 6 letter length (hexamers) words with 5 character overlap

extract_kmers(seq,6,5)

String Counts

ATCGAT 1
ATCGGG 1
CGATCG 1
CGGGTT 1
GATCGG 1
GGGTTT 1
GGTTTA 1
GTTTAC 1
TCGATC 1
TCGGGT 1

Now let us extract 8 letter length (hexamers) words with one character overlap

extract_kmers(seq,8,2)

String Counts

ATCGATCG 1
CGGGTTTA 1

Recent Posts

Links

Oct 28, 2020 - Generate overlapping strings