Biologist's bioinformatics notes

In variant calling, variants are represented in variant call format (VCF). Let us call this as sample.vcf for further discussion. Sample vcf may contain one or more samples and may be annotated against reference vcf (reference data such as dbSNP, COSMIC, HGMD, PharmGKB vcfs). Following are few one liners that make job easier.Much of the commands are taken from bcftools manual from here.

Annotate sample vcf with reference vcf and append reference information from info column and reference ID from reference vcf.

$ bcftools annotate -c ID,INFO -a reference.vcf.gz sample.vcf.gz> output.vcf

Note: 1) Reference and Sample files must be bgzipped (gzipped by bgzip) and indexed by tabix
2) bcftools is installed
3) Output is not zipped and contains reference matching IDs in ID column and corresponding INFO information from Reference VCF.

Do the same as above. But output is gzipped (not bgzipped)

$ bcftools annotate -O z -c ID,INFO -a reference.vcf.gz sample.vcf.gz > output.vcf.gz

Count variant entries in vcf and list by type (SNV, MNV etc)

$ bcftools plugin counts sample.vcf.gz

Extract clinically significant variants from annotated vcf (after annotating with dbSNP/Clinvar with CLINSIG information)

$ bcftools view -i 'CLNSIG=="5"' annotated.vcf.gz

Same as above, but validated variants

$ bcftools view -i 'CLNSIG=="5" & VLD==1' annotated.vcf.gz

Print unannotated variants (i.e variants with ID as .)

$ bcftools view -i 'ID="."' annotated.vcf.gz

Count the variants with no ID.

$ bcftools view -i 'ID="."' annotated.vcf.gz | bcftools plugin counts

Print annotated variants (i.e variants with IDs)

$ bcftools view -k annotated.vcf.gz (or)

$ bcftools view -i 'ID!="."' annotated.vcf.gz

Remove all IDs in ID column of the VCF

$ bcftools annotate -x ID sample.vcf.gz

Add text "chr" to #CHROM column of vcf

$ bcftools annotate --rename-chrs <text.file> sample.vcf.gz

Note: 1) text file should have two columns, first column should contain old name and second column should contain new name.

2) Each chromosome in each row.

3) Same can be achieved by sed also:

$ sed 's/^$[1-9,X,Y,MT]$/chr&/g' sample.vcf > output.vcf

Random sample VCF records. There may be times you need random variants from your VCF file. For this you need to know how many you need as % (for eg 1% - 0.01 of your sample variants)

$ vcfrandomsample -r 0.01 input.vcf.gz > random.vcf

Note: This would print 10 random variants in vc format if your input file has 1000 variants.

Recent Posts

Links

May 16, 2016 - VCF tricks