One of the requests I got was to parse blast results to extract sequences from 3 hsps and make it fasta. It would be easier if sequences were contiguous. In addition, user wants to fill the gaps with Ns. Steps to do is:

  1. Extract the sequences
  2. Calculate difference between current coordinate, previous last coordinate
  3. Fill the difference with Ns
  4. Concatenate all the sequences
  5. Remove the new line
  6. Add header

Folliwing is the input data:


 Score = 885 bits (479),  Expect = 0.0
 Identities = 613/676 (91%), Gaps = 16/676 (2%)
 Strand=Plus/Minus



   Query  1       ATGATAATTGATACGACAGAAGTACAAACTATCAATTCTTTTTCTATATTAGAATCCTTA  60
                   |||||||||||||||||||||||||||||||| ||||||||||||  |||||||||||||
    Sbjct  116345  ATGATAATTGATACGACAGAAGTACAAACTATTAATTCTTTTTCTGGATTAGAATCCTTA  116286

    Query  61      AAAGAAGTCTATGGACTCATATGGATTTTTGTCCCCATTTTCACCCTTGTCTTAGGAATC  120
                   |||||||| |||||||||||||||||||||||||||||||| ||||||||||||||||||
    Sbjct  116285  AAAGAAGTATATGGACTCATATGGATTTTTGTCCCCATTTTAACCCTTGTCTTAGGAATC  116226

    Query  121     ACAATGGGGGTATTAGTAATTGTGTGGTTAGAAAGAAAAATATCCGCAGCAATACAACAA  180
                   ||||||||| ||||||| |||||||| ||||||||||||||||| ||| ||| |||||||
    Sbjct  116225  ACAATGGGG-TATTAGTCATTGTGTGATTAGAAAGAAAAATATCTGCAACAACACAACAA  116167

    Query  181     CGTATTGGACCTGAATATGCCGGCCCATTAGGAATTCTTCAAGCTTTAGCGGATGGGACC  240
                    |||||||||||||||| ||||||||||||||||||||||||||||||||||||||||| 
    Sbjct  116166  TGTATTGGACCTGAATAGGCCGGCCCATTAGGAATTCTTCAAGCTTTAGCGGATGGGACG  116107

    Query  241     AAACTATTTTTGAAGGAGGATCTTCTTCCTTCTAGAGGGAATATTCGTTTGTTTAGCGTC  300
                   |||||| ||||||||||| || ||||||| |||||||| |||||||| || ||||  |||
    Sbjct  116106  AAACTACTTTTGAAGGAGAATATTCTTCCGTCTAGAGGTAATATTCGCTTATTTAAGGTC  116047

    Query  301     GGACCTTCTATAGGGGTTATATCAATTCTACTAAGTTATTTAGTAATTCCTTTTGGATAT  360
                   ||||| ||||||||| |||||||||| ||||||||||          |||||||||||||
    Sbjct  116046  GGACCCTCTATAGGGTTTATATCAATCCTACTAAGTT----------TCCTTTTGGATAT  115997

    Query  361     CACCTTGTTTTAGCTGATCTCAGTATAGGTGtttttttATGGATTGCCATTTCAAGTATT  420
                   |||||||||||||||||| ||||||||||||||||||||||||||||| |||||||||||
    Sbjct  115996  CACCTTGTTTTAGCTGATTTCAGTATAGGTGTTTTTTTATGGATTGCCTTTTCAAGTATT  115937

    Query  421     GTCCCCATTGGTCTTCTTATGTCAGGATATGGATCAAATAATAAGTATTCCTTTTCAGGC  480
                   ||||| ||||||||||||||||||||||||| ||||||||||||||||||||||||||||
    Sbjct  115936  GTCCCTATTGGTCTTCTTATGTCAGGATATGAATCAAATAATAAGTATTCCTTTTCAGGC  115877

    Query  481     GGTCTACGAGCTGCAGCTCAATCGATTAGTTATGAAATACCATTAACTCTATGTGTGTTA  540
                   ||||||||||||| || |||||  ||||||||||||||||||||||||||||||||||||
    Sbjct  115876  GGTCTACGAGCTGTAGATCAATAAATTAGTTATGAAATACCATTAACTCTATGTGTGTTA  115817

    Query  541     GCAATATCTCTACGTGCGATTCGTTTGAACATGAACtttttttCTCTATTTTCTAGAAAA  600
                   |||||||||||||||| |||||||| ||||||||||| |||| |||  ||||||||||||
    Sbjct  115816  GCAATATCTCTACGTGTGATTCGTTAGAACATGAACTCTTTT-CTC--TTTTCTAGAAAA  115760

    Query  601     GAgaaaagaaatgaattgaaatttcaatacaatataaatagaattcaatatgtaaatatg  660
                   |||| |||||||||||| ||||  ||||| |||| | ||| ||| |||||||||||||||
    Sbjct  115759  GAGATAAGAAATGAATTTAAATA-CAATAAAATAGAGATATAATGCAATATGTAAATATG  115701

    Query  661     aa-ataaaaaaaaaGA  675
                   || || ||  ||||||
    Sbjct  115700  AATATGAACGAAAAGA  115685


     Score = 845 bits (457),  Expect = 0.0
     Identities = 603/670 (90%), Gaps = 24/670 (4%)
     Strand=Plus/Minus

    Query  678     ttttttATTCAACATTTCAGTTCGATGAGTTAAACCAGATAGTTATATGAGTGAAA-CAA  736
                   ||||||||| || |||| ||||||||||||||||||||| ||||||||||||||||  ||
    Sbjct  115566  TTTTTTATTAAAAATTTTAGTTCGATGAGTTAAACCAGAGAGTTATATGAGTGAAAAAAA  115507

    Query  737     AACTGCTCCTCAATTTGCAGTAAAACAAGAAAAATCTCATTCCCTAGGTACAAGAATGAA  796
                   |||||||||||||||||||||||||||| ||||||||||||||||||||||||||| |||
    Sbjct  115506  AACTGCTCCTCAATTTGCAGTAAAACAATAAAAATCTCATTCCCTAGGTACAAGAA-GAA  115448

    Query  797     A-TTGAAGTAAACATAAGTTGTTTACCCCAAGATTGAGATTCTTTGATTAGTCGTCATAT  855
                   | ||||||||||||||||||||||||||||| ||||||||| |||  |||||||||||||
    Sbjct  115447  ATTTGAAGTAAACATAAGTTGTTTACCCCAATATTGAGATTATTTTCTTAGTCGTCATAT  115388

    Query  856     CTTGAAGCGGATGCAAAAGATCAACTGTATTTATTACTATACTGGGGATCAATCAAAAAG  915
                   |||||||||||||||||||||| ||| |||||||||||||||||| ||||||||||||||
    Sbjct  115387  CTTGAAGCGGATGCAAAAGATCCACTTTATTTATTACTATACTGGAGATCAATCAAAAAG  115328

    Query  916     AAGTGGGTAGTTAGGAACACCAAAGTACACAAAGGATGAGTAATGGAAATAATGTAAGGT  975
                   |||||   |||||||||||||||||||| |||||||||||||||| |||||||||||| |
    Sbjct  115327  AAGTGAC-AGTTAGGAACACCAAAGTACGCAAAGGATGAGTAATGAAAATAATGTAAGAT  115269

    Query  976     ATCaaa-a-aa-aGGG---GTT-TTTG--CATAAAACTTTGCATAAAACGAATCATAAT-  1025
                   |||||| | || |      ||| |||   ||||||||||| ||||||||||||| |||| 
    Sbjct  115268  ATCAAAGATAACAAAAAAAGTTATTTTTTCATAAAACTTTCCATAAAACGAATCCTAATT  115209

    Query  1026    AAGGGCTTGAAGTTGGTAGAAATGATCAAGCAGTACTTCCCCACGATTCCAATCTAGAGT  1085
                   ||||||||| | |||||||||||||||||||||||||| ||||||||| | |||||||||
    Sbjct  115208  AAGGGCTTGTAATTGGTAGAAATGATCAAGCAGTACTTTCCCACGATTACGATCTAGAGT  115149

    Query  1086    ATGCTACTATTCGCTGATTAAAGAAATGACTATCAAGAACGAATTAATCCTTTATTTTAT  1145
                   |||||||||||||||||||||| ||||||||||||||||| ||| ||||||||||||| |
    Sbjct  115148  ATGCTACTATTCGCTGATTAAATAAATGACTATCAAGAACAAATGAATCCTTTATTTTCT  115089

    Query  1146    TTCCtttttttttttAGTTTTCagaaagaagaacaggaacaagacaaatagaatgcaata  1205
                   |  |||||||||||||||||||||||  ||||||||||||||||||||||||||||||||
    Sbjct  115088  TGACTTTTTTTTTTTAGTTTTCAGAAGAAAGAACAGGAACAAGACAAATAGAATGCAATA  115029

    Query  1206    caataatagaataaaa--aagaataaaacgggaataataagaaaataTTTAGTTCTTCGT  1263
                    ||||||| |||||||  |||||||||||||||||||||||||||||||| ||| | | |
    Sbjct  115028  TAATAATATAATAAAATAAAGAATAAAACGGGAATAATAAGAAAATATTT-GTT-T-C-T  114973

    Query  1264    TTCTTCATACATATGCATATGGGAATTCTTATCATGATTCATTAACTAATGCCCAATTCT  1323
                   |     |||||||||||||||| |||| ||||||||||||||||||||||| ||||||||
    Sbjct  114972  TA----ATACATATGCATATGGAAATTTTTATCATGATTCATTAACTAATGTCCAATTCT  114917

    Query  1324    TTTTATTTAT  1333
                   ||||||||||
    Sbjct  114916  TTTTATTTAT  114907


     Score = 721 bits (390),  Expect = 0.0
     Identities = 446/473 (94%), Gaps = 4/473 (1%)
     Strand=Plus/Minus

    Query  1737    GTGGCGTCAGCCCATAGGGTTTCTAGTTTTTCTAATGTCTTCTCTAGCAGAATGTGAAAG  1796
                   ||||||||| |||||||||||||||||| |||||||||||||||||||||||||||| ||
    Sbjct  114885  GTGGCGTCAACCCATAGGGTTTCTAGTTCTTCTAATGTCTTCTCTAGCAGAATGTGAGAG  114826

    Query  1797    ATTACCCTTTGATTTACCGGAAGCAGAGGAGGAATTAGTAGCAGGTTATCAAACCGAATA  1856
                   |||||| ||| ||||||| |||| |||||||  |||||||||||||||||||||| ||||
    Sbjct  114825  ATTACCTTTTAATTTACCAGAAGTAGAGGAGATATTAGTAGCAGGTTATCAAACCAAATA  114766

    Query  1857    TTCAGGTATAAAATACGGGTTATTTTATCTTGCTTCTTACCTAAATCTATTAGTTTCTTC  1916
                   ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
    Sbjct  114765  TTCAGGTATAAAATACGGGTTATTTTATCTTGCTTCTTACCTAAATCTATTAGTTTCTTC  114706

    Query  1917    ATT----ATTTGTAACAGTTCTTTACTTAGGTGGGTGGAATTTTTCTATTCCGTACATAT  1972
                   |||    ||||||||||||||||||||||||| |||||||||| ||||||||||||||||
    Sbjct  114705  ATTCATTATTTGTAACAGTTCTTTACTTAGGTAGGTGGAATTTCTCTATTCCGTACATAT  114646

    Query  1973    CTATTACTGAACTTTTTGGAATAAATAAAATGTTTAGAGTCTTTGTAATAGCAATTGGTA  2032
                   || ||||||||||||||||||||||||||||||| |||||||||||||||||||||  ||
    Sbjct  114645  CTCTTACTGAACTTTTTGGAATAAATAAAATGTTGAGAGTCTTTGTAATAGCAATTAATA  114586

    Query  2033    TCTTTATTACATTAGCTAAAGCTTATTTGTTTCTGTTCATTCCTATCACAACAAGATGGA  2092
                   ||||||| |||||||||||||||||||||||||||||||||||||| |||||||||||||
    Sbjct  114585  TCTTTATCACATTAGCTAAAGCTTATTTGTTTCTGTTCATTCCTATAACAACAAGATGGA  114526

    Query  2093    CTTTACCTAGGATGAGAATGGATCAGTTATTAAATCTTGGATGGAAATTTCTTTTACCTA  2152
                   ||||||||||||||||||||||||| ||||||||||||||||| ||||||| ||||||||
    Sbjct  114525  CTTTACCTAGGATGAGAATGGATCAATTATTAAATCTTGGATGAAAATTTCCTTTACCTA  114466

    Query  2153    TTTCTCTAGGTAATCTATTATTGACAACTTCTTCTCAACTTGTTTCACTATAA  2205
                   |||||||||||||||||||||| |||||||||| |||||||||||||||||||
    Sbjct  114465  TTTCTCTAGGTAATCTATTATTAACAACTTCTTTTCAACTTGTTTCACTATAA  114413

There are 3 hsps and user wants to combine the subject sequences. But note that the sequences are on opposite (negative strand) and are not contiguous. Following is the solution:

$ awk '/Sbjct/ {print $2,$4,$3}' test.txt |  awk  '{print $1, $2, $3, (NR>1  ? ($1-p+1)*-1 : 0); p=$2}' | awk '{printf "%*s\n", length($3)+$4 ,$3 }'  | awk '{gsub (" ","N")}1' | tr -d "\n" | awk 'NR==1 {print ">seq1"}1'

>seq1
ATGATAATTGATACGACAGAAGTACAAACTATTAATTCTTTTTCTGGATTAGAATCCTTAAAAGAAGTATATGGACTCATATGGATTTTTGTCCCCATTTTAACCCTTGTCTTAGGAATCACAATGGGG-TATTAGTCATTGTGTGATTAGAAAGAAAAATATCTGCAACAACACAACAATGTATTGGACCTGAATAGGCCGGCCCATTAGGAATTCTTCAAGCTTTAGCGGATGGGACGAAACTACTTTTGAAGGAGAATATTCTTCCGTCTAGAGGTAATATTCGCTTATTTAAGGTCGGACCCTCTATAGGGTTTATATCAATCCTACTAAGTT----------TCCTTTTGGATATCACCTTGTTTTAGCTGATTTCAGTATAGGTGTTTTTTTATGGATTGCCTTTTCAAGTATTGTCCCTATTGGTCTTCTTATGTCAGGATATGAATCAAATAATAAGTATTCCTTTTCAGGCGGTCTACGAGCTGTAGATCAATAAATTAGTTATGAAATACCATTAACTCTATGTGTGTTAGCAATATCTCTACGTGTGATTCGTTAGAACATGAACTCTTTT-CTC--TTTTCTAGAAAAGAGATAAGAAATGAATTTAAATA-CAATAAAATAGAGATATAATGCAATATGTAAATATGAATATGAACGAAAAGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTATTAAAAATTTTAGTTCGATGAGTTAAACCAGAGAGTTATATGAGTGAAAAAAAAACTGCTCCTCAATTTGCAGTAAAACAATAAAAATCTCATTCCCTAGGTACAAGAA-GAAATTTGAAGTAAACATAAGTTGTTTACCCCAATATTGAGATTATTTTCTTAGTCGTCATATCTTGAAGCGGATGCAAAAGATCCACTTTATTTATTACTATACTGGAGATCAATCAAAAAGAAGTGAC-AGTTAGGAACACCAAAGTACGCAAAGGATGAGTAATGAAAATAATGTAAGATATCAAAGATAACAAAAAAAGTTATTTTTTCATAAAACTTTCCATAAAACGAATCCTAATTAAGGGCTTGTAATTGGTAGAAATGATCAAGCAGTACTTTCCCACGATTACGATCTAGAGTATGCTACTATTCGCTGATTAAATAAATGACTATCAAGAACAAATGAATCCTTTATTTTCTTGACTTTTTTTTTTTAGTTTTCAGAAGAAAGAACAGGAACAAGACAAATAGAATGCAATATAATAATATAATAAAATAAAGAATAAAACGGGAATAATAAGAAAATATTT-GTT-T-C-TTA----ATACATATGCATATGGAAATTTTTATCATGATTCATTAACTAATGTCCAATTCTTTTTATTTATNNNNNNNNNNNNNNNNNNNNNGTGGCGTCAACCCATAGGGTTTCTAGTTCTTCTAATGTCTTCTCTAGCAGAATGTGAGAGATTACCTTTTAATTTACCAGAAGTAGAGGAGATATTAGTAGCAGGTTATCAAACCAAATATTCAGGTATAAAATACGGGTTATTTTATCTTGCTTCTTACCTAAATCTATTAGTTTCTTCATTCATTATTTGTAACAGTTCTTTACTTAGGTAGGTGGAATTTCTCTATTCCGTACATATCTCTTACTGAACTTTTTGGAATAAATAAAATGTTGAGAGTCTTTGTAATAGCAATTAATATCTTTATCACATTAGCTAAAGCTTATTTGTTTCTGTTCATTCCTATAACAACAAGATGGACTTTACCTAGGATGAGAATGGATCAATTATTAAATCTTGGATGAAAATTTCCTTTACCTATTTCTCTAGGTAATCTATTATTAACAACTTCTTTTCAACTTGTTTCACTATAA

What is happening here:

step 1. Print lines starting with “Sbjct”, but only columns 2 (start coordinate), 4 (end coordinate), 3 (Sequence)

step 2. Print column start, stop, sequence and print the difference between current start line and previous stop line

step 3. Pad the sequence length by the difference calculated before step (by empty spaces)

step 4. Replace the empty spaces with N

step 5. Remove spaces between lines

step 6. Add header as first line and then print every thing else from previous step