Let us say you have two files one with peptides/short sequences of interest and another with expected peptide. Experimentally derived sequence may or may not match with calculated sequence. These differences are as short as one letter or as long as entire sequence except one or two bases/aminoacids. Let us say we have a file with following information:
P1 SLVFLPFnT
P2 KLLLAtKSL
P3 sIWKHATPV
P4 KVTSIQhWV
P5 MtYDRYVAI
and another file with following information:
P1 KVTSIQAWV 2
P2 KVTSIQCWV 2.5
P3 KVTSIQDWV 4.5
P4
MTYDRVVAI 5
Now user wants to extract all the sequences in file2 that match with those from file
1. However, they are different by one amino acid. Output should contain peptides
and values from second file. Let us do it in R with a package called "fuzzyjoin".
Code is as follows (test1.txt = 1st file, test2.txt=2nd file above):
=========================
df1= read.csv("test1.txt", sep = "\t", stringsAsFactors = F, header = F)
df2= read.csv("test2.txt", sep = "\t", stringsAsFactors = F, header = F)
library(fuzzyjoin)
df3=stringdist_inner_join(df1, df2, by=c("V2"="V2"), max_dist=1, ignore_case=T)
===========================
output:
=================
> df3
V1.x V2.x V1.y V2.y V3
1 P4 KVTSIQhWV P1 KVTSIQAWV 2.0
2 P4 KVTSIQhWV P2 KVTSIQCWV 2.5
3 P4 KVTSIQhWV P3 KVTSIQDWV 4.5
4 P5 MtYDRYVAI P4 MTYDRVVAI 5.0
===================
Now output can be customized by subsetting df3 for the required information (columns)