I have come across this request where user wants to trim a multi sequence fasta file as per the very first sequence in an aligned file with gaps. Following is the example


$cat test.fa

>seq1
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-------------------------------cttgagctggggtctggccatggggtaaa
gaagcagcagcagagacagaccaatgccaatg----------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
------------------------------------------------------------
-----------
>seq2
ggatcagcaccagccccggctgctccttcaccttgagttgtggccgggccatggggtaga
gcaggagcaacaaggacagcccgataccgatcaggatcccatactcaatgcccacacata
gtgaacccaaaaaggtggccacatgcacaaacagatcccatttattggtgcgccacaaaa
cggggatgattttgtagtcgaccatctgcatgacggccatgatgatgaccgcggccagcg
ctgacttggggatgtagtaacagtagggcaccaggaaggccagtactaacaggatcaggg
accctgtgaaaagaccattcgccggtgttcttacaccgctctgtgagttgacagcagttc
tggaaaaactgccggtgacaggataggaatgaacaaaggaactgagaatgttggcagtac

As you see, user has an aligned file. Very first sequence has the sequence of interest, but has leading and lagging gaps. Now user wants to extract the first sequence and the sequence at those coordinates for rest of the sequences, in alignment file. Following is the soltuion:

First let us the get the coordinates of the sequence:


$seqkit head -n 1 test.fa | seqkit locate -Prip '([ATGC]+)' 

seqID	patternName	pattern	strand	start	end	matched
seq1	([ATGC]+)	([ATGC]+)	+	212	272	cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatg

Now let us take the coordinates 212 and 272, and extract the sequence between those positions for all the sequences:

$ seqkit -w 0 subseq -r 212:272 test.fa                      

>seq1
cttgagctggggtctggccatggggtaaagaagcagcagcagagacagaccaatgccaatg
>seq2
gacggccatgatgatgaccgcggccagcgctgacttggggatgtagtaacagtagggcacc

Interesting request I would say.