Most biologists know that a codon represents an amino acid and is composed of 3 contiguous bases. Of the three bases, third base shows limited variability. Because of this an amino acid can be encoded by more than one codon. One of the requests I got was to convert every third base to corresponding purine (R) and pyrimidine (Y) symbol. As per IUPAC nucleotide code, bases A and G are purines and bases C and T are pyrimidines. Let us do this in python without any external library. I am using python 3 for this task. First let us take a simple fasta with arbitrary number of sequences and convert that into a sequence file with R and Y at every third base.
input:
========================================
$ cat sequence.fa
> gene1
ACATATTGGAGGCCGAAACAATGAGGCGTGATCAACTCAGTATATCAC
> gene2
CTAACCTCTCCCAGTGTGGAACCTCTATCTCATGAGAAAGCTGGGATGAG
> gene3
ATTTCCTCCTGCTGCCCGGGAGGTAACACCCTGGACCCCTGGAGTCTGCA
========================================
output:
========================================
$ cat output.fa
>gene1
ACRTAYTGRAGRCCRAARCARTGRGGYGTRATYAAYTCRGTRTAYCAY
>gene2
CTRACYTCYCCYAGYGTRGARCCYCTRTCYCAYGARAARGCYGGRATRAG
>gene3
ATYTCYTCYTGYTGYCCRGGRGGYAAYACYCTRGAYCCYTGRAGYCTRCA
============================================
code (note that input and output files are hard coded. change it as per requirement):
======================================
> d = {}
> with open("sequence.fa", 'r') as f:
for i in f:
d[i.strip("\n> ")] = next(f).rstrip()
> ntdict = {"A": "R", "T": "Y", "G": "R", "C": "Y"}
> with open("output.fa", "w") as f:
for key, value in d.items():
temp_key = "".join(ntdict.get(w.upper(), w) if not i % 3 else w for i, w in enumerate(list(d[key]), 1))
print("{0}{1}\n{2}".format(">", key, temp_key.upper()), file=f)
============================================
Ps: Take care of indents.