Let us say we have a text file and we need to annotate it with dbSNP vcf to get rsIDs. We will use an example text file as follows and down load the text file from here:
Now we have starting coordinate in 2nd column, end coordinate in 3rd column. Chromosome names in 1st column do not match the chromosome name convention followed by dbSNP. Instead of chromosome 1, dbSNP uses "1" as chromosome name. We will remove it while we run the script. In addition, bedtools expects the file to be in bed format. Bed format is zero based where as vcf is 1 based. One must be careful when matching such files together. Please note that reference file used in the note is an example file. You need to download dbSNP vcf files for appropriate genome builds.
To get only rsIDs, one needs some code writing and run it. Instead what we would do here is:
- Convert text to bed
- Use bedtools to intersect reference vcf (dbSNP vcf)
- Get all the records from dbSNP and input file. Then we can filter in whatever way we want.
Above steps can be done in a single line (download the reference file from here):
================================================================
$ sed "s/^chr//g" test.txt | awk -v OFS="\t" '{$2=$2-1;print $1,$2,$3,$4}' | bedtools intersect -wb -wa -a stdin -b ref.sorted.vcf.gz
================================================================
output is as follows:
==================================================================
1 207679306 207679307 pan1 1 207679307 rs4844600 A C,G . . RS=4844600;RSPOS=207679307;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050328000b05110536000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;PM;PMC;S3D;SLO;NSM;REF;SYN;ASP;G5;HD;GNO;KGPhase1;KGPhase3;CAF=0.1496,.,0.8504;COMMON=1;TOPMED=0.16070177115188583,0.00227765035677879,0.83702057849133537
1 207684191 207684192 pan1 1 207684192 rs12037841 T G . . RS=12037841;RSPOS=207684192;dbSNPBuildID=120;SSR=0;SAO=0;VP=0x050100080005110136000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;SLO;INT;ASP;G5;GNO;KGPhase1;KGPhase3;CAF=0.06649,0.9335;COMMON=1;TOPMED=0.12155963302752293,0.87844036697247706
1 207685785 207685786 pan1 1 207685786 rs4266886 T C . . RS=4266886;RSPOS=207685786;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050100080005110536000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;SLO;INT;ASP;G5;HD;GNO;KGPhase1;KGPhase3;CAF=0.1426,0.8574;COMMON=1;TOPMED=0.16106810652395514,0.83893189347604485
1 207685964 207685965 pan1 1 207685965 rs4562624 A C,T . . RS=4562624;RSPOS=207685965;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050100080005110136000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;SLO;INT;ASP;G5;GNO;KGPhase1;KGPhase3;CAF=0.06669,0.9333,.;COMMON=1;TOPMED=0.12107384046890927,0.87892615953109072,.
1 207692048 207692049 pan1 1 207692048 rs386638846 CATC CGTT . . RS=386638846;RSPOS=207692049;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000080005000002000800;GENEINFO=CR1:1378;WGT=1;VC=MNV;INT;ASP
1 207692048 207692049 pan1 1 207692049 rs6656401 A G,T . . RS=6656401;RSPOS=207692049;dbSNPBuildID=116;SSR=0;SAO=0;VP=0x05016808000511053e030100;GENEINFO=CR1:1378;WGT=1;VC=SNV;PM;PMC;SLO;INT;ASP;G5;HD;GNO;KGPhase1;KGPhase3;MTP;OM;CAF=0.06669,0.9333,.;COMMON=1;TOPMED=0.09828937308868501,0.87834480122324159,0.02336582568807339
==================================================================
Output will have input columns (in bed format) and all the matching output from dbSNP VCF. Now we can extract information we want using cut, awk or any other column based extraction tool.