There are times, we want to replace sequence (in fasta format) headers using parts of the headers. Here is an example queestion and it’s solution using seqkit.

Example input:

>seq1 protein1
MEFKH
>seq2 protein 2
MKESR
>seq3 protein 3
MNNQR

User has another file with multicolumn mapping as below:

$ cat test.tsv
seq1	group_1000	ID_01
seq2	group_1001	ID_02
seq3	group_1002	ID_03      

Now user wants to append 3rd column (starting with ID) at the after sequence ID (seq1,2,3), but retain rest of the text too. Expected output is:

>seq1 ID_01  protein1
MEFKH
>seq2 ID_02  protein 2
MKESR
>seq3 ID_03  protein 3
MNNQR  

Code for above problem is:

$ seqkit -w 0 --quiet replace -p '^(\w+)( .+)$' -r '${1} {kv} ${2}' -k <(awk -v OFS="\t" '{print $1,$3}'  test.tsv) test.fa