Some times brain acts like a child. Let us a give an example of conversation between me and my kid. Background is that I am trying to give her (10 yo) example of bird flying and principle behind aeroplanes.
me: Hi! do you know how birds fly?
kid: which one exactly?
me: any bird. sweetie!
kid: well, be specific. You ask me to be specific.
me: let us say crow
kid: You mean the crow that keeps cawing on outside tree
me: yes
kid: Do you know yesterday it was limping?
me: why? did you do any thing?
kid: no, i think some one hit it.
me: oh bad. One should not harm animals willingly. We should care for other humans and animals. (not realizing that she pulled me into her world)
Now my self and brain (Mr. B- Male brain):
me: Hi! I saw a post and user wants to get rsIDs
Mr.B: well, what format is the data?
me: VCF
Mr.B: How can you get rsIDs for vCard format?
me: It's not contact format. It variant call format.
Mr.B: oh okay.
me: Now I want a method. Give me now
Mr.B: sure, take this loop in R.
me: great. Thanks. I will work on it.
Mr.B: (after 5 min) You know what? I have apply method. No need of loop.
me: Why didn't you tell me earlier. Give me that.
Mr.B: (after 5 min) You don't even need R. You can use bedtools and intersect
me: oh boy. Give me that.
Mr.B: (after 5 min). Well, you don't need even that. You need awk, parallel and tabix. Simply extract it from NCBI servers.
me: (banging my head virtually), give me that. Why didn't you give me this first?
Mr.B: I am getting sleep. Moreover, you suffer from staircase wit.
Well, this is how I get side tracked and never be able to get answers straight. Let me get to the point. Some one requested to get rsIDs, given chromosome coordinates for genome build b37/hg19. I had to do three times almost. Here is the input file:
==================
chr1 207679307 207679307
chr1 207684192 207684192
chr1 207685786 207685786
chr1 207685965 207685965
chr1 207692049 207692049
====================
me: Hi! do you know how birds fly?
kid: which one exactly?
me: any bird. sweetie!
kid: well, be specific. You ask me to be specific.
me: let us say crow
kid: You mean the crow that keeps cawing on outside tree
me: yes
kid: Do you know yesterday it was limping?
me: why? did you do any thing?
kid: no, i think some one hit it.
me: oh bad. One should not harm animals willingly. We should care for other humans and animals. (not realizing that she pulled me into her world)
Now my self and brain (Mr. B- Male brain):
me: Hi! I saw a post and user wants to get rsIDs
Mr.B: well, what format is the data?
me: VCF
Mr.B: How can you get rsIDs for vCard format?
me: It's not contact format. It variant call format.
Mr.B: oh okay.
me: Now I want a method. Give me now
Mr.B: sure, take this loop in R.
me: great. Thanks. I will work on it.
Mr.B: (after 5 min) You know what? I have apply method. No need of loop.
me: Why didn't you tell me earlier. Give me that.
Mr.B: (after 5 min) You don't even need R. You can use bedtools and intersect
me: oh boy. Give me that.
Mr.B: (after 5 min). Well, you don't need even that. You need awk, parallel and tabix. Simply extract it from NCBI servers.
me: (banging my head virtually), give me that. Why didn't you give me this first?
Mr.B: I am getting sleep. Moreover, you suffer from staircase wit.
Well, this is how I get side tracked and never be able to get answers straight. Let me get to the point. Some one requested to get rsIDs, given chromosome coordinates for genome build b37/hg19. I had to do three times almost. Here is the input file:
==================
chr1 207679307 207679307
chr1 207684192 207684192
chr1 207685786 207685786
chr1 207685965 207685965
chr1 207692049 207692049
====================
Now user wants to get rsIDs, alleles. We can annotate using R, Shell (shell + system tools). Let us look at all the options. First let us do it in command line using awk, parallel and tabix
============================================
$ awk '{gsub("chr","",$1)} {print $1":"$2"-"$3}' coord.txt | parallel --delay 1 tabix ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz {} | awk -v OFS="\t" 'BEGIN {print "chromosome","Coordinate","rsID","Ref","Alt"} {print "chr"$1,$2,$3,$4,$5}'
output:
chr1 207684192 rs12037841 T G
chr1 207679307 rs4844600 A C,G
chr1 207685786 rs4266886 T C
chr1 207692048 rs386638846 CATC CGTT
chr1 207692049 rs6656401 A G,T
==============================================
chr1 207684192 rs12037841 T G
chr1 207679307 rs4844600 A C,G
chr1 207685786 rs4266886 T C
chr1 207692048 rs386638846 CATC CGTT
chr1 207692049 rs6656401 A G,T
==============================================
- First we will chop of string "chr" from the first column (using awk)
- Use parallel to send parallel requests to NCBI servers to get the information.
- get desired output by Awk operations as follows:
- Print header (using awk begin)
- Append string "chr" to the first column
- Print the columns using information from NCBI. Output from tabix is in general is VCF. So we take first 5 columns.
====================================================
Now let us do the same with biomart and R (with loop and without loop):
- First read file
- Create enviroment for biomart to download the annotations from
- Create empty data frame to fill later
- Use with and without loop methods for getting annotations.
====================================================
## Load the coordinates file
$ coords=read.csv("coord.txt", stringsAsFactors = F, strip.white = T, header = F, sep="\t")
## Chop off string "chr" from the first column.
$ coords$V1=gsub("chr","",coords$V1)
## Load library biomart
$ library(biomaRt)
## Create a mart and dataset from which we would like get information from
$ mart=useMart(biomart="ENSEMBL_MART_SNP", host= "grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")
$ mart=useMart(biomart="ENSEMBL_MART_SNP", host= "grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_snp")
## Create an empty data frame. Rows do not matter. Number of columns must ## be same number as attributes in getBM function.
$ results=data.frame(matrix(ncol=5,nrow=0))=
$ results=data.frame(matrix(ncol=5,nrow=0))=
=============================================================
Now get the annotations using a loop
===============================================================
## Now write a loop and append the output to dataframe (results) row wise (rbind)
$ for (i in seq (1,nrow(coords))){
ens=getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'), filters = c('chr_name','start','end'), values = as.list(coords[i,]), mart = mart)
results=rbind(results,ens)
}
ens=getBM(attributes = c('refsnp_id','allele','chrom_start','chrom_strand'), filters = c('chr_name','start','end'), values = as.list(coords[i,]), mart = mart)
results=rbind(results,ens)
}
=================================================================
Get the annotations using apply function, without using loop
================================================================
$ do.call(rbind,(apply(coords,1, function (x) getBM(attributes = c('refsnp_id', 'chrom_start','chrom_end', 'chrom_strand','allele'), filters = c('chr_name','start','end'), values = as.list(x), mart = mart))))
================================================================
Now one of the gentlemen on Biostars forum corrected the code that apply and loops are not necessary for annotation. I agree with him/her totally. Following is the better way in biomart:
=======================================================
## create a new dataframe with by pasting all the data into chrom:start:end format.
$ coords1=apply(coords,1, paste, collapse=":")
## Now run the getBM function
$ getBM(attributes = c('refsnp_id','chrom_start','chrom_end', 'chrom_strand', 'allele'), filters = "chromosomal_region", values = coords1, mart = mart)
========================================================If you have any queries, please post some example data here so that I can understand the problem. Please do not describe the data. For some reason, let us say you have file in text format, where in first 3 columns are coordinate information and rest of the columns are some other data, you can use this tutorial to annotate the data: https://digibio.blogspot.com/2020/08/annotate-text-file-with-dbsnp-reference.html