Biologist's bioinformatics notes

For bioinformaticians, fasta files routine files to deal with. There are certain times, a bioinformatician needs following information from the fasta file (with few sequences within it):

1) Extract partial sequence out of each sequence and store it in a separate file as different sequences
2) Extract partial sequence out of each sequence and store in a separate file as one single sequence.

Strange it may seems with such a request. But one never knows what comes from a some one higher up.

Let us work with an example fasta file with three sequences in it.

=======================

>gene1

ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC

>gene2

CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG

>gene3

ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA

=========================

Above example contains 3 sequences saved in a file (let us call the file: genes.fa). We have two requirements:

1) Extract first one third of nucleotides (bases) from each sequence and store the extracted sequences as separate sequences in a new file. For eg. take first 1/3 bases from gene 1 and store it as a new sequence. Do the same thing for second and third genes as well. New file will have 3 sequences, but with shortened sequences.

2) Extract first one third of nucleotides (bases) from each sequence and store the extracted sequences as a single sequence in a new file. Take first 1/3 bases from gene1, first 1/3 bases from gene2 and first 1/3 bases from gene3, combine all of them into one single sequence.

For this, let us use R and biostrings package. code is as follows:

Let us do first part:

# load the library biostrings

$ library(Biostrings)

# Read the file and save it as biostring object

$ genes=readDNAStringSet("genes.fa", format = "fasta")

# Extract first 1/3 sequence from each sequence in biostring object

$ subfasta=subseq(s,1,ceiling(width(s)/3))

# Save it to a file subgenes.fa with three newly created sequences

$ writeXStringSet(subfasta, "subgenes.fa")

Let us do second part:

# load the library biostrings

$ library(Biostrings)

# Read the file and save it as biostring object

$ genes=readDNAStringSet("genes.fa", format = "fasta")

# Extract first 1/3 sequence from each sequence in biostring object

$ subfasta=subseq(s,1,ceiling(width(s)/3))

# combine all the new sequences in to one single sequence
$ csubfasta=DNAStringSet(unlist(subfasta))

# Give the name to new sequence
$ names(csubfasta)="new_gene"

# Write this new sequence to a new file.
$ writeXStringSet(DNAStringSet(scubfasta),"new_gene.fasta")

Recent Posts

Links

Oct 4, 2016 - Extract sequences from fasta file