Recently, I saw a post where user has a problem with mismatch between fasta headers (in reference file) and the gtf file. User wants all of them being changed to same kind: chr1, chr2, chr3 etc.
Example gtf file:
$ sed '/#/! s/.*_\(chr[0-9]\+\).*\(\tGL*\)/\1\2/g' test.gtf
However, the problem is that strings in second column vary. For eg it could be BGI or Genewise. (string and string lengths both vary). So I proposed an awk solution:
What this does it filters all the headers in gtf i.e gets lines that doesn't start with #. Then split the first column by delimiter "_" and stores in an array by name "a". Then second element in array is now made column one and print the output with tab as separator. If you split string "CA_chr4_BGI-A2_v1.0" with "_" as delimiter, second element catches chr string.
$ awk '/>/ {split($1,a,"_"); $1=">"a[2]}1' test.fa
Example gtf file:
##gff-version null
CA_chr4_BGI-A2_v1.0 GLEAN mRNA 123284514 123288477 0.999991- . ID=Cotton_A_18927_BGI-A2_v1.0;Name=Cotton_A_18927;source_id=CottonA_GLEAN_10022228;identical_support_id=CUFF67.1103.1;evid_id=Cot030308.1
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123288376 123288477 . -0 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123287662 123287826 . -0 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123287427 123287536 . -0 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123287129 123287237 . -1 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123286939 123287051 . -0 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123286180 123286330 . -1 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr4_BGI-A2_v1.0 GLEAN CDS 123284514 123285671 . -0 Parent=Cotton_A_18927_BGI-A2_v1.0
CA_chr9_BGI-A2_v1.0 GLEAN mRNA 17802711 17803334 1 +. ID=Cotton_A_16149_BGI-A2_v1.0;Name=Cotton_A_16149;source_id=CottonA_GLEAN_10030787;evid_id=Cot023903.1
Example fasta file$ cat test.fa
>gnl|BGIA2|CA_chr1
atgc
>gnl|BGIA2|CA_chr22
atgc
>gnl|BGIA2|CA_chr10
atgc
>gnl|BGIA2|CA_chr13
atgc
I first used sed to address this issue in the lines of
$ sed '/>/ s/.*_\(chr[0-9]\+\)/>\1/g' test.fa
In this code, I made chromosome names (chr[0-9]+) as pattern in sed and asked the command to print. I did similar way to extract chromosomes from gtf as well assuming that second column always contain string "GLEAN", in following lines:
However, the problem is that strings in second column vary. For eg it could be BGI or Genewise. (string and string lengths both vary). So I proposed an awk solution:
$ awk '!/#/ {split($1, a, "_"); $1=a[2]}1' OFS='\t' test.gtf
What this does it filters all the headers in gtf i.e gets lines that doesn't start with #. Then split the first column by delimiter "_" and stores in an array by name "a". Then second element in array is now made column one and print the output with tab as separator. If you split string "CA_chr4_BGI-A2_v1.0" with "_" as delimiter, second element catches chr string.
Now I did the same thing with fasta as well.
Now I am splitting the fasta header with "_" delimiter and taking 2nd element and append ">" before the element so that headers are restored.
Now both the files (gff and fasta) have same chromosome ids.