One of the requests I got was to parse blast results to extract sequences from 3 hsps and make it fasta. It would be easier if sequences were contiguous. In addition, user wants to fill the gaps with Ns. Steps to do is:
- Extract the sequences
- Calculate difference between current coordinate, previous last coordinate
- Fill the difference with Ns
- Concatenate all the sequences
- Remove the new line
- Add header
Folliwing is the input data:
Score = 885 bits (479), Expect = 0.0
Identities = 613/676 (91%), Gaps = 16/676 (2%)
Strand=Plus/Minus
Query 1 ATGATAATTGATACGACAGAAGTACAAACTATCAATTCTTTTTCTATATTAGAATCCTTA 60
|||||||||||||||||||||||||||||||| |||||||||||| |||||||||||||
Sbjct 116345 ATGATAATTGATACGACAGAAGTACAAACTATTAATTCTTTTTCTGGATTAGAATCCTTA 116286
Query 61 AAAGAAGTCTATGGACTCATATGGATTTTTGTCCCCATTTTCACCCTTGTCTTAGGAATC 120
|||||||| |||||||||||||||||||||||||||||||| ||||||||||||||||||
Sbjct 116285 AAAGAAGTATATGGACTCATATGGATTTTTGTCCCCATTTTAACCCTTGTCTTAGGAATC 116226
Query 121 ACAATGGGGGTATTAGTAATTGTGTGGTTAGAAAGAAAAATATCCGCAGCAATACAACAA 180
||||||||| ||||||| |||||||| ||||||||||||||||| ||| ||| |||||||
Sbjct 116225 ACAATGGGG-TATTAGTCATTGTGTGATTAGAAAGAAAAATATCTGCAACAACACAACAA 116167
Query 181 CGTATTGGACCTGAATATGCCGGCCCATTAGGAATTCTTCAAGCTTTAGCGGATGGGACC 240
|||||||||||||||| |||||||||||||||||||||||||||||||||||||||||
Sbjct 116166 TGTATTGGACCTGAATAGGCCGGCCCATTAGGAATTCTTCAAGCTTTAGCGGATGGGACG 116107
Query 241 AAACTATTTTTGAAGGAGGATCTTCTTCCTTCTAGAGGGAATATTCGTTTGTTTAGCGTC 300
|||||| ||||||||||| || ||||||| |||||||| |||||||| || |||| |||
Sbjct 116106 AAACTACTTTTGAAGGAGAATATTCTTCCGTCTAGAGGTAATATTCGCTTATTTAAGGTC 116047
Query 301 GGACCTTCTATAGGGGTTATATCAATTCTACTAAGTTATTTAGTAATTCCTTTTGGATAT 360
||||| ||||||||| |||||||||| |||||||||| |||||||||||||
Sbjct 116046 GGACCCTCTATAGGGTTTATATCAATCCTACTAAGTT----------TCCTTTTGGATAT 115997
Query 361 CACCTTGTTTTAGCTGATCTCAGTATAGGTGtttttttATGGATTGCCATTTCAAGTATT 420
|||||||||||||||||| ||||||||||||||||||||||||||||| |||||||||||
Sbjct 115996 CACCTTGTTTTAGCTGATTTCAGTATAGGTGTTTTTTTATGGATTGCCTTTTCAAGTATT 115937
Query 421 GTCCCCATTGGTCTTCTTATGTCAGGATATGGATCAAATAATAAGTATTCCTTTTCAGGC 480
||||| ||||||||||||||||||||||||| ||||||||||||||||||||||||||||
Sbjct 115936 GTCCCTATTGGTCTTCTTATGTCAGGATATGAATCAAATAATAAGTATTCCTTTTCAGGC 115877
Query 481 GGTCTACGAGCTGCAGCTCAATCGATTAGTTATGAAATACCATTAACTCTATGTGTGTTA 540
||||||||||||| || ||||| ||||||||||||||||||||||||||||||||||||
Sbjct 115876 GGTCTACGAGCTGTAGATCAATAAATTAGTTATGAAATACCATTAACTCTATGTGTGTTA 115817
Query 541 GCAATATCTCTACGTGCGATTCGTTTGAACATGAACtttttttCTCTATTTTCTAGAAAA 600
|||||||||||||||| |||||||| ||||||||||| |||| ||| ||||||||||||
Sbjct 115816 GCAATATCTCTACGTGTGATTCGTTAGAACATGAACTCTTTT-CTC--TTTTCTAGAAAA 115760
Query 601 GAgaaaagaaatgaattgaaatttcaatacaatataaatagaattcaatatgtaaatatg 660
|||| |||||||||||| |||| ||||| |||| | ||| ||| |||||||||||||||
Sbjct 115759 GAGATAAGAAATGAATTTAAATA-CAATAAAATAGAGATATAATGCAATATGTAAATATG 115701
Query 661 aa-ataaaaaaaaaGA 675
|| || || ||||||
Sbjct 115700 AATATGAACGAAAAGA 115685
Score = 845 bits (457), Expect = 0.0
Identities = 603/670 (90%), Gaps = 24/670 (4%)
Strand=Plus/Minus
Query 678 ttttttATTCAACATTTCAGTTCGATGAGTTAAACCAGATAGTTATATGAGTGAAA-CAA 736
||||||||| || |||| ||||||||||||||||||||| |||||||||||||||| ||
Sbjct 115566 TTTTTTATTAAAAATTTTAGTTCGATGAGTTAAACCAGAGAGTTATATGAGTGAAAAAAA 115507
Query 737 AACTGCTCCTCAATTTGCAGTAAAACAAGAAAAATCTCATTCCCTAGGTACAAGAATGAA 796
|||||||||||||||||||||||||||| ||||||||||||||||||||||||||| |||
Sbjct 115506 AACTGCTCCTCAATTTGCAGTAAAACAATAAAAATCTCATTCCCTAGGTACAAGAA-GAA 115448
Query 797 A-TTGAAGTAAACATAAGTTGTTTACCCCAAGATTGAGATTCTTTGATTAGTCGTCATAT 855
| ||||||||||||||||||||||||||||| ||||||||| ||| |||||||||||||
Sbjct 115447 ATTTGAAGTAAACATAAGTTGTTTACCCCAATATTGAGATTATTTTCTTAGTCGTCATAT 115388
Query 856 CTTGAAGCGGATGCAAAAGATCAACTGTATTTATTACTATACTGGGGATCAATCAAAAAG 915
|||||||||||||||||||||| ||| |||||||||||||||||| ||||||||||||||
Sbjct 115387 CTTGAAGCGGATGCAAAAGATCCACTTTATTTATTACTATACTGGAGATCAATCAAAAAG 115328
Query 916 AAGTGGGTAGTTAGGAACACCAAAGTACACAAAGGATGAGTAATGGAAATAATGTAAGGT 975
||||| |||||||||||||||||||| |||||||||||||||| |||||||||||| |
Sbjct 115327 AAGTGAC-AGTTAGGAACACCAAAGTACGCAAAGGATGAGTAATGAAAATAATGTAAGAT 115269
Query 976 ATCaaa-a-aa-aGGG---GTT-TTTG--CATAAAACTTTGCATAAAACGAATCATAAT- 1025
|||||| | || | ||| ||| ||||||||||| ||||||||||||| ||||
Sbjct 115268 ATCAAAGATAACAAAAAAAGTTATTTTTTCATAAAACTTTCCATAAAACGAATCCTAATT 115209
Query 1026 AAGGGCTTGAAGTTGGTAGAAATGATCAAGCAGTACTTCCCCACGATTCCAATCTAGAGT 1085
||||||||| | |||||||||||||||||||||||||| ||||||||| | |||||||||
Sbjct 115208 AAGGGCTTGTAATTGGTAGAAATGATCAAGCAGTACTTTCCCACGATTACGATCTAGAGT 115149
Query 1086 ATGCTACTATTCGCTGATTAAAGAAATGACTATCAAGAACGAATTAATCCTTTATTTTAT 1145
|||||||||||||||||||||| ||||||||||||||||| ||| ||||||||||||| |
Sbjct 115148 ATGCTACTATTCGCTGATTAAATAAATGACTATCAAGAACAAATGAATCCTTTATTTTCT 115089
Query 1146 TTCCtttttttttttAGTTTTCagaaagaagaacaggaacaagacaaatagaatgcaata 1205
| ||||||||||||||||||||||| ||||||||||||||||||||||||||||||||
Sbjct 115088 TGACTTTTTTTTTTTAGTTTTCAGAAGAAAGAACAGGAACAAGACAAATAGAATGCAATA 115029
Query 1206 caataatagaataaaa--aagaataaaacgggaataataagaaaataTTTAGTTCTTCGT 1263
||||||| ||||||| |||||||||||||||||||||||||||||||| ||| | | |
Sbjct 115028 TAATAATATAATAAAATAAAGAATAAAACGGGAATAATAAGAAAATATTT-GTT-T-C-T 114973
Query 1264 TTCTTCATACATATGCATATGGGAATTCTTATCATGATTCATTAACTAATGCCCAATTCT 1323
| |||||||||||||||| |||| ||||||||||||||||||||||| ||||||||
Sbjct 114972 TA----ATACATATGCATATGGAAATTTTTATCATGATTCATTAACTAATGTCCAATTCT 114917
Query 1324 TTTTATTTAT 1333
||||||||||
Sbjct 114916 TTTTATTTAT 114907
Score = 721 bits (390), Expect = 0.0
Identities = 446/473 (94%), Gaps = 4/473 (1%)
Strand=Plus/Minus
Query 1737 GTGGCGTCAGCCCATAGGGTTTCTAGTTTTTCTAATGTCTTCTCTAGCAGAATGTGAAAG 1796
||||||||| |||||||||||||||||| |||||||||||||||||||||||||||| ||
Sbjct 114885 GTGGCGTCAACCCATAGGGTTTCTAGTTCTTCTAATGTCTTCTCTAGCAGAATGTGAGAG 114826
Query 1797 ATTACCCTTTGATTTACCGGAAGCAGAGGAGGAATTAGTAGCAGGTTATCAAACCGAATA 1856
|||||| ||| ||||||| |||| ||||||| |||||||||||||||||||||| ||||
Sbjct 114825 ATTACCTTTTAATTTACCAGAAGTAGAGGAGATATTAGTAGCAGGTTATCAAACCAAATA 114766
Query 1857 TTCAGGTATAAAATACGGGTTATTTTATCTTGCTTCTTACCTAAATCTATTAGTTTCTTC 1916
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 114765 TTCAGGTATAAAATACGGGTTATTTTATCTTGCTTCTTACCTAAATCTATTAGTTTCTTC 114706
Query 1917 ATT----ATTTGTAACAGTTCTTTACTTAGGTGGGTGGAATTTTTCTATTCCGTACATAT 1972
||| ||||||||||||||||||||||||| |||||||||| ||||||||||||||||
Sbjct 114705 ATTCATTATTTGTAACAGTTCTTTACTTAGGTAGGTGGAATTTCTCTATTCCGTACATAT 114646
Query 1973 CTATTACTGAACTTTTTGGAATAAATAAAATGTTTAGAGTCTTTGTAATAGCAATTGGTA 2032
|| ||||||||||||||||||||||||||||||| ||||||||||||||||||||| ||
Sbjct 114645 CTCTTACTGAACTTTTTGGAATAAATAAAATGTTGAGAGTCTTTGTAATAGCAATTAATA 114586
Query 2033 TCTTTATTACATTAGCTAAAGCTTATTTGTTTCTGTTCATTCCTATCACAACAAGATGGA 2092
||||||| |||||||||||||||||||||||||||||||||||||| |||||||||||||
Sbjct 114585 TCTTTATCACATTAGCTAAAGCTTATTTGTTTCTGTTCATTCCTATAACAACAAGATGGA 114526
Query 2093 CTTTACCTAGGATGAGAATGGATCAGTTATTAAATCTTGGATGGAAATTTCTTTTACCTA 2152
||||||||||||||||||||||||| ||||||||||||||||| ||||||| ||||||||
Sbjct 114525 CTTTACCTAGGATGAGAATGGATCAATTATTAAATCTTGGATGAAAATTTCCTTTACCTA 114466
Query 2153 TTTCTCTAGGTAATCTATTATTGACAACTTCTTCTCAACTTGTTTCACTATAA 2205
|||||||||||||||||||||| |||||||||| |||||||||||||||||||
Sbjct 114465 TTTCTCTAGGTAATCTATTATTAACAACTTCTTTTCAACTTGTTTCACTATAA 114413
There are 3 hsps and user wants to combine the subject sequences. But note that the sequences are on opposite (negative strand) and are not contiguous. Following is the solution:
$ awk '/Sbjct/ {print $2,$4,$3}' test.txt | awk '{print $1, $2, $3, (NR>1 ? ($1-p+1)*-1 : 0); p=$2}' | awk '{printf "%*s\n", length($3)+$4 ,$3 }' | awk '{gsub (" ","N")}1' | tr -d "\n" | awk 'NR==1 {print ">seq1"}1'
>seq1
ATGATAATTGATACGACAGAAGTACAAACTATTAATTCTTTTTCTGGATTAGAATCCTTAAAAGAAGTATATGGACTCATATGGATTTTTGTCCCCATTTTAACCCTTGTCTTAGGAATCACAATGGGG-TATTAGTCATTGTGTGATTAGAAAGAAAAATATCTGCAACAACACAACAATGTATTGGACCTGAATAGGCCGGCCCATTAGGAATTCTTCAAGCTTTAGCGGATGGGACGAAACTACTTTTGAAGGAGAATATTCTTCCGTCTAGAGGTAATATTCGCTTATTTAAGGTCGGACCCTCTATAGGGTTTATATCAATCCTACTAAGTT----------TCCTTTTGGATATCACCTTGTTTTAGCTGATTTCAGTATAGGTGTTTTTTTATGGATTGCCTTTTCAAGTATTGTCCCTATTGGTCTTCTTATGTCAGGATATGAATCAAATAATAAGTATTCCTTTTCAGGCGGTCTACGAGCTGTAGATCAATAAATTAGTTATGAAATACCATTAACTCTATGTGTGTTAGCAATATCTCTACGTGTGATTCGTTAGAACATGAACTCTTTT-CTC--TTTTCTAGAAAAGAGATAAGAAATGAATTTAAATA-CAATAAAATAGAGATATAATGCAATATGTAAATATGAATATGAACGAAAAGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTATTAAAAATTTTAGTTCGATGAGTTAAACCAGAGAGTTATATGAGTGAAAAAAAAACTGCTCCTCAATTTGCAGTAAAACAATAAAAATCTCATTCCCTAGGTACAAGAA-GAAATTTGAAGTAAACATAAGTTGTTTACCCCAATATTGAGATTATTTTCTTAGTCGTCATATCTTGAAGCGGATGCAAAAGATCCACTTTATTTATTACTATACTGGAGATCAATCAAAAAGAAGTGAC-AGTTAGGAACACCAAAGTACGCAAAGGATGAGTAATGAAAATAATGTAAGATATCAAAGATAACAAAAAAAGTTATTTTTTCATAAAACTTTCCATAAAACGAATCCTAATTAAGGGCTTGTAATTGGTAGAAATGATCAAGCAGTACTTTCCCACGATTACGATCTAGAGTATGCTACTATTCGCTGATTAAATAAATGACTATCAAGAACAAATGAATCCTTTATTTTCTTGACTTTTTTTTTTTAGTTTTCAGAAGAAAGAACAGGAACAAGACAAATAGAATGCAATATAATAATATAATAAAATAAAGAATAAAACGGGAATAATAAGAAAATATTT-GTT-T-C-TTA----ATACATATGCATATGGAAATTTTTATCATGATTCATTAACTAATGTCCAATTCTTTTTATTTATNNNNNNNNNNNNNNNNNNNNNGTGGCGTCAACCCATAGGGTTTCTAGTTCTTCTAATGTCTTCTCTAGCAGAATGTGAGAGATTACCTTTTAATTTACCAGAAGTAGAGGAGATATTAGTAGCAGGTTATCAAACCAAATATTCAGGTATAAAATACGGGTTATTTTATCTTGCTTCTTACCTAAATCTATTAGTTTCTTCATTCATTATTTGTAACAGTTCTTTACTTAGGTAGGTGGAATTTCTCTATTCCGTACATATCTCTTACTGAACTTTTTGGAATAAATAAAATGTTGAGAGTCTTTGTAATAGCAATTAATATCTTTATCACATTAGCTAAAGCTTATTTGTTTCTGTTCATTCCTATAACAACAAGATGGACTTTACCTAGGATGAGAATGGATCAATTATTAAATCTTGGATGAAAATTTCCTTTACCTATTTCTCTAGGTAATCTATTATTAACAACTTCTTTTCAACTTGTTTCACTATAA
What is happening here:
step 1. Print lines starting with “Sbjct”, but only columns 2 (start coordinate), 4 (end coordinate), 3 (Sequence)
step 2. Print column start, stop, sequence and print the difference between current start line and previous stop line
step 3. Pad the sequence length by the difference calculated before step (by empty spaces)
step 4. Replace the empty spaces with N
step 5. Remove spaces between lines
step 6. Add header as first line and then print every thing else from previous step