Biologist's bioinformatics notes

Let us say we have a text file and we need to annotate it with dbSNP vcf to get rsIDs. We will use an example text file as follows and down load the text file from here:

Now we have starting coordinate in 2nd column, end coordinate in 3rd column. Chromosome names in 1st column do not match the chromosome name convention followed by dbSNP. Instead of chromosome 1, dbSNP uses "1" as chromosome name. We will remove it while we run the script. In addition, bedtools expects the file to be in bed format. Bed format is zero based where as vcf is 1 based. One must be careful when matching such files together. Please note that reference file used in the note is an example file. You need to download dbSNP vcf files for appropriate genome builds.

To get only rsIDs, one needs some code writing and run it. Instead what we would do here is:

Convert text to bed
Use bedtools to intersect reference vcf (dbSNP vcf)
Get all the records from dbSNP and input file. Then we can filter in whatever way we want.

Above steps can be done in a single line (download the reference file from here):

================================================================

$ sed "s/^chr//g" test.txt | awk -v OFS="\t" '{$2=$2-1;print $1,$2,$3,$4}' | bedtools intersect -wb -wa -a stdin -b ref.sorted.vcf.gz

================================================================

output is as follows:

==================================================================
1   207679306   207679307   pan1   1   207679307   rs4844600   A   C,G   .   .   RS=4844600;RSPOS=207679307;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050328000b05110536000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;PM;PMC;S3D;SLO;NSM;REF;SYN;ASP;G5;HD;GNO;KGPhase1;KGPhase3;CAF=0.1496,.,0.8504;COMMON=1;TOPMED=0.16070177115188583,0.00227765035677879,0.83702057849133537
1   207684191   207684192   pan1   1   207684192   rs12037841   T   G   .   .   RS=12037841;RSPOS=207684192;dbSNPBuildID=120;SSR=0;SAO=0;VP=0x050100080005110136000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;SLO;INT;ASP;G5;GNO;KGPhase1;KGPhase3;CAF=0.06649,0.9335;COMMON=1;TOPMED=0.12155963302752293,0.87844036697247706
1   207685785   207685786   pan1   1   207685786   rs4266886   T   C   .   .   RS=4266886;RSPOS=207685786;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050100080005110536000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;SLO;INT;ASP;G5;HD;GNO;KGPhase1;KGPhase3;CAF=0.1426,0.8574;COMMON=1;TOPMED=0.16106810652395514,0.83893189347604485
1   207685964   207685965   pan1   1   207685965   rs4562624   A   C,T   .   .   RS=4562624;RSPOS=207685965;dbSNPBuildID=111;SSR=0;SAO=0;VP=0x050100080005110136000100;GENEINFO=CR1:1378;WGT=1;VC=SNV;SLO;INT;ASP;G5;GNO;KGPhase1;KGPhase3;CAF=0.06669,0.9333,.;COMMON=1;TOPMED=0.12107384046890927,0.87892615953109072,.
1   207692048   207692049   pan1   1   207692048   rs386638846   CATC   CGTT   .   .   RS=386638846;RSPOS=207692049;dbSNPBuildID=138;SSR=0;SAO=0;VP=0x050000080005000002000800;GENEINFO=CR1:1378;WGT=1;VC=MNV;INT;ASP
1   207692048   207692049   pan1   1   207692049   rs6656401   A   G,T   .   .   RS=6656401;RSPOS=207692049;dbSNPBuildID=116;SSR=0;SAO=0;VP=0x05016808000511053e030100;GENEINFO=CR1:1378;WGT=1;VC=SNV;PM;PMC;SLO;INT;ASP;G5;HD;GNO;KGPhase1;KGPhase3;MTP;OM;CAF=0.06669,0.9333,.;COMMON=1;TOPMED=0.09828937308868501,0.87834480122324159,0.02336582568807339

==================================================================

Output will have input columns (in bed format) and all the matching output from dbSNP VCF. Now we can extract information we want using cut, awk or any other column based extraction tool.

Recent Posts

Links

Aug 15, 2020 - Annotate text file with dbSNP reference