There are times where you would like to know the gene/transcript/exon for your SNP of interest. So you have SNP details in one data frame and gene and it's coordinates in another dataframe. For SNP, let us say you are have SNP ID and position (with/without other information) and for chromosome/gene/transcript/exon, you have coordinates (start and stop) and ID (chromosome number/gene symbol/transcript symbol/exon number).
Now problem is to join both the data frames so you can get chromosome/gene/transcript/exon name/ID and coordinates. Let me take an example here:
SNPs of interest:
SNV pos
1 SNP1 899983
2 SNP2 937283
3 SNP3 951771
4 SNP4 991102
5 SNP5 1034215
6 SNP6 1063818
Code to create this table/dataframe in R:
========================================
$ df1<-data.frame(SNV=paste0("SNP",seq(1,6)),
pos=c(899983,937283,951771,991102,1034215,1063818))
============================================
Reference genes and coordinates:
gene start stop
1 g1 355079 355336
2 g2 446820 462604
3 g3 700794 707341
4 g4 887159 1236478
5 g5 110971 112320
6 g6 156060 284753
Code to create this table/dataframe in R:
==========================================
df2<-data.frame(gene=c('g1','g2','g3','g4','g5','g6'),
start=c(355079,446820,700794,887159,110971,156060),
stop=c(355336,462604,707341,1236478,112320,284753))
==========================================
Now the problem we are trying to address to is where does my SNP lie and which gene has the variants and coordinates of the gene(s) that houses the variants.
This can be done in more than 2 ways in R. Let us solve this problem in two ways.
Code using fuzzyjoin library:
$ library(fuzzyjoin)
$ fuzzy_join(df1, df2, by = c("pos" = "start", "pos" = "stop"), match_fun = list(`>=`, `<=`))
output:
SNV pos gene start stop
1 SNP1 899983 g4 887159 1236478
2 SNP2 937283 g4 887159 1236478
3 SNP3 951771 g4 887159 1236478
4 SNP4 991102 g4 887159 1236478
5 SNP5 1034215 g4 887159 1236478
6 SNP6 1063818 g4 887159 1236478
It seems that all variants on g4 gene. Now what we did in above code was to join two dataframes and column "pos" from df1 on two columns of df2 ("start" and "stop"). We ask the code to look for variant position that more than or equal to start and less than or equal to stop i.e in between the start and stop of df2.
Let us do the same, with sqldf library:
$ library(sqldf)
$ sqldf("select * from df1 left join df2 on df1.pos between df2.start and df2.stop")
output:
SNV pos gene start stop
1 SNP1 899983 g4 887159 1236478
2 SNP2 937283 g4 887159 1236478
3 SNP3 951771 g4 887159 1236478
4 SNP4 991102 g4 887159 1236478
5 SNP5 1034215 g4 887159 1236478
6 SNP6 1063818 g4 887159 1236478