There are times that fasta headers are too big and one would like to use certain part of fasta header. This is easy if it is between two words (strings) and what if it is between two wild charaters such as | and ,.
Let us do it in sed,awk and R.Example fasta:
$ cat test.fa
>gi|3834|ref|NC_2356.3|test1, name1
ATCGT
>gi|8356|ref|NC_713.3|test2, name2
GTCTGG
What we want is to rename the fasta to:
>test1
ATCGT
>test2
GTCTGG
Sed Code:
$ sed -r '/>/ s/.*\|(.*),.*/>\1/' test2.fa
What is happening here:
-r -- To make Sed to use extended expressions (works on FreeBSD and Linux)
/>/ -- pattern >. This means select lines with >
s -- replace/substitute
.*\|(.*),.* -- a regular expression that denotes all characters before |, all characters between | and , and all characters after ,. Characters between | and , are put in brackets so that this pattern is marked 1.
>\1 -- Pattern (1--characters (strings) between | and ,) to replace above regular expression (.*\|(.*),.*)
Awk code:
What is happening here:
- use multiple delimiters: -F
$ library(Biostrings)
What is happening here:
Let us do it in sed,awk and R.Example fasta:
$ cat test.fa
>gi|3834|ref|NC_2356.3|test1, name1
ATCGT
>gi|8356|ref|NC_713.3|test2, name2
GTCTGG
What we want is to rename the fasta to:
>test1
ATCGT
>test2
GTCTGG
Sed Code:
$ sed -r '/>/ s/.*\|(.*),.*/>\1/' test2.fa
What is happening here:
-r -- To make Sed to use extended expressions (works on FreeBSD and Linux)
/>/ -- pattern >. This means select lines with >
s -- replace/substitute
.*\|(.*),.* -- a regular expression that denotes all characters before |, all characters between | and , and all characters after ,. Characters between | and , are put in brackets so that this pattern is marked 1.
>\1 -- Pattern (1--characters (strings) between | and ,) to replace above regular expression (.*\|(.*),.*)
Awk code:
$ awk -F '[/^>|,]' 'NF>1{print ">"$6} {print $1}' test1.fa | awk NF
What is happening here:
- use multiple delimiters: -F
- delimiters:
/^>|, (/^> - line with > at the start,
|, - field delimiters)
- NF > 1 - select line with number of fields more than 1
- print 6 field, but append >
- print 1 field
- NF - remove blank lines
Rcode:
$ library(Biostrings)
$ library(stringr)
$ fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
$ names(fasta)=str_split_fixed(str_split_fixed(names(fasta),"\\|",5)[,5],",",2)[,1]
$ writeXStringSet(fasta, filepath = 'test_edited.fa',format="fasta")
What is happening here:
- load libraries (biostrings and stringr)
- read fasta file from hard disk
- names(fasta) are the fasta headers in original file
- names(fasta) are the fasta headers in original file
- Function str_split_fixed is used twice. First time it is used to split fasta header by delimiter | and extract word/string (5th string). Second time it used on first time output (5th string) and split by delimiter , and extract first word. In short, a string/word is extracted first time and same word/string is split further to get the word/string of our interest.
- Once headers are changed, now write to hard disk, in fasta format.