Biologist's bioinformatics notes

VCF is variant call format i.e it is a format defined to store variants. Though this format is meant to store variants, currently, it is being used to store entire genomic information (i.e all the genomic coordinates). In that case, it is called gVCF (g stands for genomic). VCF file being such a central point in annotation of NGS analysis.

How to index VCF?

There are several tools index vcf. IGV tools, Tabix etc. Indexing would help user extract variants and their properties (in INFO and FORMAT columns).

Indexing with IGVtools:

Requirements:

1) IGV tools (can be downloaded from downloads section in
http://www.broadinstitute.org/software/igv/log-in. This needs one time registration)
2) Any of the GNU-linux distributions (rpm based, deb based, source based)
3) vcf file

Please note that IGVtools is a stickler to VCF standards. VCF standards can be understood from http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41. If there is non-standard entry, IGVtools refuse to index it. For eg. INFO field should not have entries with space between them. Cardio vascular disease is not accepted, where cardiovasculardisease is accepted. IGV tools throws an error in later case.

Once all the required files in place run following command:

sh <path to igv tools>/igvtools index <input>.vcf

Example code:
sh /opt/igvtools_2.3.32/igvtools index gatk_output.vcf

This would create a new file with idx extension.
Using the above example, user should see two files: gatk_output.vcf and gatk_output.vcf.idx (.idx file- index file)

This index file is mandatory to load VCF file in IGV. Current versions of IGV prompts user if idx file is missing. However, it looks for index file in the same folder where VCF file is present.

Indexing with tabix:

Requirements:
1) Install tabix and bgzip (ubuntu/debian have these two tools in repositories)
2) VCF file

Indexing and extracting part of VCF:

1) First zip the vcf file using bgzip. gzipped files are not indexed by tabix.

command line: bgzip <input>.vcf
Example: bgzip gatk_output.vcf

bgzip will create zipped vcf file with gz extension. In the above example, gatk_output.vcf.gz would be created in user space.

2) Index the zipped vcf file using tabix

command line: tabix -p vcf <input_vcf.gz.>
Example: tabix -p vcf gatk_output.vcf.gz

-p option is to denote input file type. In this case, it is vcf. Tabix supports indexing of gff, bed, sam, vcf.

After indexing, user should see a new file with extension .tbi in the current foler. In the above example, user should see gatk_output.vcf.gz.tbi in current folder.

Index the zipped vcf file using bcftools:
command line: bcftools index <input_vcf.gz.>
Example: bcftools index gatk_output.vcf.gz

3) Extract region of interest:

Command line:   tabix <input.vcf.gz> <region of interest>
Example: tabix gatk_output.vcf.gz chr18: 1-100000

Out put would be in original format without VCF headers. To get VCF headers, type following command:

Command line:   tabix -h <input.vcf.gz> <region of interest>
Example: tabix -h gatk_output.vcf.gz chr18: 1-100000

User can direct the output to another vcf file:

Example: tabix -h gatk_output.vcf.gz chr18: 1-100000 > chr18.vcf

To extract partial information from VCFs stored on public servers: For eg. dbSNP VCFs stored on ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/

1) Create an index using tabix for VCF file: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz

command line:   tabix -p vcf ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz

This would take some time and creates an index file locally. In general, index files on remote servers will have different time stamps due to which extracting information fails (my general observation with NCBI servers). Tabix doesn't work if time stamps are different for eg. if time stamp on index file is earlier than source gzip file it would not work.

2) Extract region of interest

Command line: tabix -h <remote_server_vcf.gz> <region of interest>
Example: tabix -h   ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/All.vcf.gz chr18: 1-100000

Recent Posts

Links

Jul 15, 2014 - Parse VCF 1