There were two requests to manipulate a fasta with awk.
- Break a sequnece in two halves. Print second half before first half
- Extract middle 150bp from each sequence
input fasta sequence:
$ cat test.fa
>read1
CAGGTCTGGCTGGATGAAGGGCACGGCATAGGTCTGACCTGCCAGGGAGTGCTGCATCCTCACAGGAGTCATGGTGCTGCTGAAGATGTCTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCGCCATCTGCTGGACGGCCTCCTCTCGCCG
>read2
CATCTGCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGGCCTGGAGAGGCTGCCGGGAGGTAGGAGTGGTGAGGTCGACTTGAGAAAGTTCAGGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCCATTGTAACGATGGAGCTGTGCCTGTGGAGGCTGTTGTGAGGCAGTAGCCT
>read3
TTGAGGTGGGAGGATCGCTTCAGCCTGGAAGGTTGAGGCTGCAGTCAGCTGCGATAGCACTACTACACTCCAGCCTTGGACAACAGAGGGAGACCTTTCGCTGTCACCCCTCTAGAATCCACGTATACGAAAATTCCAAATGTTAGTTGGGCATAGTGGCAAGCACCTGTAGTCTCAGCCACGTGGGAGG
Now let’s break this into two halves and print the second half before first half. Before any parsing, we need to flatten the fasta sequence i.e convert the multiline fasta sequence into single line sequence for each entry. We can do it with awk or with specialized tools such as seqkit.
$ seqkit seq -w 0 test.fa | awk -v OFS="\n" '/^>/ {getline seq}{L=length(seq); L2=int(L/2); print $0, substr(seq,L2+1) substr(seq,1,L2) }'
>read1
CTGCTGAAGATGTCTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCGCCATCTGCTGGACGGCCTCCTCTCGCCGCAGGTCTGGCTGGATGAAGGGCACGGCATAGGTCTGACCTGCCAGGGAGTGCTGCATCCTCACAGGAGTCATGGTG
>read2
GTGGTGAGGTCGACTTGAGAAAGTTCAGGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCCATTGTAACGATGGAGCTGTGCCTGTGGAGGCTGTTGTGAGGCAGTAGCCTCATCTGCGGAGGCTGCCGTGACGTAGGGTATGGGCCTAAATAGGCCATTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGGCCTGGAGAGGCTGCCGGGAGGTAGGA
>read3
TTTCGCTGTCACCCCTCTAGAATCCACGTATACGAAAATTCCAAATGTTAGTTGGGCATAGTGGCAAGCACCTGTAGTCTCAGCCACGTGGGAGGTTGAGGTGGGAGGATCGCTTCAGCCTGGAAGGTTGAGGCTGCAGTCAGCTGCGATAGCACTACTACACTCCAGCCTTGGACAACAGAGGGAGACC
Now let’s extract only the middle 150 bp part from each sequence
$ awk -v OFS="\n" '/^>/ {getline seq}{print $0,substr(seq,length(seq)/2-75,150)}' test.fa
>read1
CAGGTCTGGCTGGATGAAGGGCACGGCATAGGTCTGACCTGCCAGGGAGTGCTGCATCCTCACAGGAGTCATGGTGCTGCTGAAGATGTCTCCAGAGACCTTCTGCAGGTACTGCAGGGCATCCGCCATCTGCTGGACGGCCTCCTCTCG
>read2
TTGTGAGTCATGAGCTTGGTCTGTAGAGGCTGACTGGAGAAAGTTCTGGGCCTGGAGAGGCTGCCGGGAGGTAGGAGTGGTGAGGTCGACTTGAGAAAGTTCAGGGCCTGGAGAGAAGGCTGGGAGGCAGGAGCTGGGTCTAAAGAGGCC
>read3
TCAGCCTGGAAGGTTGAGGCTGCAGTCAGCTGCGATAGCACTACTACACTCCAGCCTTGGACAACAGAGGGAGACCTTTCGCTGTCACCCCTCTAGAATCCACGTATACGAAAATTCCAAATGTTAGTTGGGCATAGTGGCAAGCACCTG