Biologist's bioinformatics notes

There are times where you would like to know the gene/transcript/exon for your SNP of interest. So you have SNP details in one data frame and gene and it's coordinates in another dataframe. For SNP, let us say you are have SNP ID and position (with/without other information) and for chromosome/gene/transcript/exon, you have coordinates (start and stop) and ID (chromosome number/gene symbol/transcript symbol/exon number).

Now problem is to join both the data frames so you can get chromosome/gene/transcript/exon name/ID and coordinates. Let me take an example here:

SNPs of interest:

   SNV     pos
1   SNP1 899983
2   SNP2 937283
3   SNP3 951771
4   SNP4 991102
5   SNP5 1034215
6   SNP6 1063818

Code to create this table/dataframe in R:
========================================
$ df1<-data.frame(SNV=paste0("SNP",seq(1,6)),
    pos=c(899983,937283,951771,991102,1034215,1063818))
============================================

Reference genes and coordinates:

gene start    stop
1   g1 355079 355336
2   g2 446820 462604
3   g3 700794 707341
4   g4 887159 1236478
5   g5 110971 112320
6   g6 156060 284753

Code to create this table/dataframe in R:
==========================================
df2<-data.frame(gene=c('g1','g2','g3','g4','g5','g6'),
                start=c(355079,446820,700794,887159,110971,156060),
                stop=c(355336,462604,707341,1236478,112320,284753))
==========================================

Now the problem we are trying to address to is where does my SNP lie and which gene has the variants and coordinates of the gene(s) that houses the variants.

This can be done in more than 2 ways in R. Let us solve this problem in two ways.

Code using fuzzyjoin library:

$ library(fuzzyjoin)
$ fuzzy_join(df1, df2, by = c("pos" = "start", "pos" = "stop"), match_fun = list(`>=`, `<=`))

output:
   SNV     pos       gene start    stop
1 SNP1 899983    g4    887159 1236478
2 SNP2 937283    g4    887159 1236478
3 SNP3 951771    g4    887159 1236478
4 SNP4 991102    g4    887159 1236478
5 SNP5 1034215   g4    887159 1236478
6 SNP6 1063818   g4    887159 1236478

It seems that all variants on g4 gene. Now what we did in above code was to join two dataframes and column "pos" from df1 on two columns of df2 ("start" and "stop"). We ask the code to look for variant position that more than or equal to start and less than or equal to stop i.e in between the start and stop of df2.

Let us do the same, with sqldf library:

$ library(sqldf)
$ sqldf("select * from df1 left join df2 on df1.pos between df2.start and df2.stop")

output:
SNV     pos           gene start    stop
1 SNP1 899983    g4      887159 1236478
2 SNP2 937283    g4      887159 1236478
3 SNP3 951771    g4      887159 1236478
4 SNP4 991102    g4      887159 1236478
5 SNP5 1034215   g4      887159 1236478
6 SNP6 1063818   g4      887159 1236478

Recent Posts

Links

Nov 9, 2017 - Join dataframes by range in R