Biologist's bioinformatics notes

Recently, I saw a post where user has a problem with mismatch between fasta headers (in reference file) and the gtf file. User wants all of them being changed to same kind: chr1, chr2, chr3 etc.
Example gtf file:

##gff-version null
CA_chr4_BGI-A2_v1.0     GLEAN   mRNA    123284514       123288477       0.999991-       .       ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123288376       123288477       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287662       123287826       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287427       123287536       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123287129       123287237       .      -1       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123286939       123287051       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123286180       123286330       .      -1       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0     GLEAN   CDS     123284514       123285671       .      -0       Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr9_BGI-A2_v1.0     GLEAN   mRNA    17802711        17803334        1      +.       ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1

Example fasta file

$ cat test.fa 
>gnl|BGIA2|CA_chr1
atgc
>gnl|BGIA2|CA_chr22
atgc
>gnl|BGIA2|CA_chr10
atgc
>gnl|BGIA2|CA_chr13
atgc

I first used sed to address this issue in the lines of

$ sed '/>/ s/.*_$chr[0-9]\+$/>\1/g' test.fa

In this code, I made chromosome names (chr[0-9]+) as pattern in sed and asked the command to print. I did similar way to extract chromosomes from gtf as well assuming that second column always contain string "GLEAN", in following lines:

$ sed '/#/! s/.*_$chr[0-9]\+$.*$\tGL*$/\1\2/g' test.gtf

However, the problem is that strings in second column vary. For eg it could be BGI or Genewise. (string and string lengths both vary). So I proposed an awk solution:

$ awk '!/#/ {split($1, a, "_"); $1=a[2]}1' OFS='\t' test.gtf

What this does it filters all the headers in gtf i.e gets lines that doesn't start with #. Then split the first column by delimiter "_" and stores in an array by name "a". Then second element in array is now made column one and print the output with tab as separator. If you split string "CA_chr4_BGI-A2_v1.0" with "_" as delimiter, second element catches chr string.

Now I did the same thing with fasta as well.

$ awk '/>/ {split($1,a,"_"); $1=">"a[2]}1' test.fa

Now I am splitting the fasta header with "_" delimiter and taking 2nd element and append ">" before the element so that headers are restored.

Now both the files (gff and fasta) have same chromosome ids.

Recent Posts

Links

May 9, 2018 - Manipulate fasta headers and GTF sequence IDs with awk