Sometimes, established tools drive us nuts. Following is an example. I have a vcf that is not sorted. Download it from here (bgzipped, please un-gzip for the use) .Now let us follow the standard procedure to index this unsorted file.
- bgzip the file
- Use tabix to index the file.
Now you would get following error:
========================================
$ tabix -p vcf ref.vcf.gz
[ti_index_core] the file out of order at line 59
==========================================
Since VCF is not sorted, tabix is throwing this error. Next logical step would be to use bcftools (I am using bcftools 1.10.2) to sort the vcf using "sort" function. Now run bcftools to sort the vcf. Following error you would get:
===============================================================
$ bcftools sort ref.vcf.gz
Writing to /tmp/bcftools-sort.RNcNFP
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
Error encountered while parsing the input at 1:207684192
Cleaning
==============================================================
Where should user go now? Tabix fails because vcf is not sorted and bcftools fails that chromosome number is not defined in header and fails to sort because of that. In addition to failing, it is suggesting user to use a tool that fails because of sorting.
Fortunately, vcftools comes to the rescue.
===========================
$ vcf-sort ref.vcf.gz
===========================
This populates vcf that can be used for indexing and tabix works as intended now, as the vcf is sorted.