some times, simple things are difficult for beginners esp when documentation addresses high level functions and difficult examples than simple ones and simple work flows. Let us look at one such function with biopython/python. Our tasks are:
1) To extract all the sequences from fasta file and write each sequence into a file with fasta header as file name (without ">").
2) Extract only headers (without ">") into a separate file. This would give us all the sequence names in the fasta file. Consider following test file:
Example fasta file (test.fa):
>seq1
ATGCGGCGA
>seq2
GACTACTA
1) Extact each sequence into a file:
Now we want to extract seq1 and seq2 sequences and store them into two files with seq1 and seq2 as file names. Below code works well with python 3 and above and biopython 1.5 and above.
===========================================================
seqIO.write will take three options: what to write (here seq- each sequence), file name to write (seq.id - fasta sequence ID in testa.fa) and the format (fasta) here.
Output:
2) Extract all sequence names in a fasta file (i.e all those lines starting with > and while extracting discard character ">"):
use the above fasta file and output file contains seq1 and se2
=====================================================
>>> from Bio import SeqIO
>>> from datetime import datetime
>>> dnafile = open("ids_"+datetime.now().strftime("%Y%m%d_%H%M%S"+".txt"), "a")
>>> for record in SeqIO.parse("test.fa","fasta"):
dnafile.write(record.id+"\n")
>>> dnafile.close()
====================================================
In file name (dnafile) above we have incorporate time stamp from the system. This is a simple precaution for not overwriting ids.txt.
output:
=====================================
$ cat ids_20171204_162253.txt (20171204_162253= time stamp)
seq1
seq2
======================================
1) To extract all the sequences from fasta file and write each sequence into a file with fasta header as file name (without ">").
2) Extract only headers (without ">") into a separate file. This would give us all the sequence names in the fasta file. Consider following test file:
Example fasta file (test.fa):
>seq1
ATGCGGCGA
>seq2
GACTACTA
1) Extact each sequence into a file:
Now we want to extract seq1 and seq2 sequences and store them into two files with seq1 and seq2 as file names. Below code works well with python 3 and above and biopython 1.5 and above.
===========================================================
>>> from Bio import SeqIO
>>> for seq in SeqIO.parse("test.fa", "fasta"):
SeqIO.write(seq,seq.id+".fa","fasta")
============================================================seqIO.write will take three options: what to write (here seq- each sequence), file name to write (seq.id - fasta sequence ID in testa.fa) and the format (fasta) here.
Output:
=============================
$ ls seq*.fa
seq1.fa seq2.fa
----
$ cat seq1.fa
>seq1
ATGCGGCGA
$ cat seq2.fa
>seq2
GACTACTA
============================
2) Extract all sequence names in a fasta file (i.e all those lines starting with > and while extracting discard character ">"):
use the above fasta file and output file contains seq1 and se2
=====================================================
>>> from Bio import SeqIO
>>> from datetime import datetime
>>> dnafile = open("ids_"+datetime.now().strftime("%Y%m%d_%H%M%S"+".txt"), "a")
>>> for record in SeqIO.parse("test.fa","fasta"):
dnafile.write(record.id+"\n")
>>> dnafile.close()
====================================================
In file name (dnafile) above we have incorporate time stamp from the system. This is a simple precaution for not overwriting ids.txt.
output:
=====================================
$ cat ids_20171204_162253.txt (20171204_162253= time stamp)
seq1
seq2
======================================