For bioinformaticians, fasta files routine files to deal with. There are certain times, a bioinformatician needs following information from the fasta file (with few sequences within it):
1) Extract partial sequence out of each sequence and store it in a separate file as different sequences
2) Extract partial sequence out of each sequence and store in a separate file as one single sequence.
Strange it may seems with such a request. But one never knows what comes from a some one higher up.
Let us work with an example fasta file with three sequences in it.
Let us do first part:
# load the library biostrings
# Give the name to new sequence
$ names(csubfasta)="new_gene"
# Write this new sequence to a new file.
$ writeXStringSet(DNAStringSet( scubfasta),"new_gene.fasta")
1) Extract partial sequence out of each sequence and store it in a separate file as different sequences
2) Extract partial sequence out of each sequence and store in a separate file as one single sequence.
Strange it may seems with such a request. But one never knows what comes from a some one higher up.
Let us work with an example fasta file with three sequences in it.
=======================
>gene1
ACATATTGGAGGCCGAAACAATGAGGCGTG ATCAACTCAGTATATCAC
>gene2
CTAACCTCTCCCAGTGTGGAACCTCTATCT CATGAGAAAGCTGGGATGAG
>gene3
ATTTCCTCCTGCTGCCCGGGAGGTAACACC CTGGACCCCTGGAGTCTGCA
=========================
Above example contains 3 sequences saved in a file (let us call the file: genes.fa). We have two requirements:
1) Extract first one third of nucleotides (bases) from each sequence and store the extracted sequences as separate sequences in a new file. For eg. take first 1/3 bases from gene 1 and store it as a new sequence. Do the same thing for second and third genes as well. New file will have 3 sequences, but with shortened sequences.
2) Extract first one third of nucleotides (bases) from each sequence and
store the extracted sequences as a single sequence in a new file. Take first 1/3 bases from gene1, first 1/3 bases from gene2 and first 1/3 bases from gene3, combine all of them into one single sequence.
For this, let us use R and biostrings package. code is as follows:
Let us do first part:
# load the library biostrings
$ library(Biostrings)
# Read the file and save it as biostring object
# Read the file and save it as biostring object
$ genes=readDNAStringSet("genes. fa", format = "fasta")
# Extract first 1/3 sequence from each sequence in biostring object
# Extract first 1/3 sequence from each sequence in biostring object
$ subfasta=subseq(s,1,ceiling( width(s)/3))
# Save it to a file subgenes.fa with three newly created sequences
# Save it to a file subgenes.fa with three newly created sequences
$ writeXStringSet(subfasta, "subgenes.fa")
Let us do second part:
# load the library biostrings
width(s)/3))
# combine all the new sequences in to one single sequence
$ csubfasta=DNAStringSet(unlist( subfasta))
Let us do second part:
# load the library biostrings
$ library(Biostrings)
# Read the file and save it as biostring object
# Read the file and save it as biostring object
$ genes=readDNAStringSet("genes. fa", format = "fasta")
# Extract first 1/3 sequence from each sequence in biostring object
$ subfasta=subseq(s,1,ceiling(# Extract first 1/3 sequence from each sequence in biostring object
# combine all the new sequences in to one single sequence
$ csubfasta=DNAStringSet(unlist(
# Give the name to new sequence
$ names(csubfasta)="new_gene"
# Write this new sequence to a new file.
$ writeXStringSet(DNAStringSet(