Biologist's bioinformatics notes

Sequence information of genes and proteins are stored in fasta format. These fasta files are text files with preliminary markups. Several data manipulations can be done on these files. One of them is random printing of sequence files. Let us say there are 10 species and each species has 4 sequences. OP wanted to publish sequences random for each species. In the following example, there are 3 species, each species has certain number sequences (.seq{0-4}). OP wants each species and one random sequence for each species, in fasta format.

$ cat test.fa
>SpeciesA.seq0
CCACTTTA
>SpeciesA.seq1
CCTCTTTA
>SpeciesA.seq2
CCGCTTTA
>SpeciesA.seq3
CCACTTTA
>SpeciesB.seq0
GCCCTTTA
>SpeciesB.seq1
GCCCTTTA
>SpeciesB.seq2
ACCCTTTA
>SpeciesB.seq3
GCCCTTTA
>SpeciesC.seq0
GCCCTTTA
>SpeciesC.seq1
GCCCTTTA

For this we need seqkit tool. Use bash shell for execution

=============================================================

$ for i in $(seqkit seq -n test.fa | cut -f1 -d"." | uniq);
    do seqkit grep -rp $i test.fa  | seqkit seq -n | shuf -n1 | grep -A1 -f - test.fa;
done;

============================================================

Another way of doing the same but using awk and datamash:

===========

$ awk 'BEGIN{RS=">"}{print $1"\t"$2}' test.fa |sed -e '1d;s/\./\t/g'| datamash -sg1 rand 2,3 | sed 's/\t/\./' |  awk 'BEGIN{RS="\n"}{print ">"$1"\n"$2}'

==========
Using datamash and sesqkit:

===========================

$ seqkit fx2tab test.fa | sed 's/\./\t/' | datamash -sg1 rand 2,3 | sed 's/\t/\./' | seqkit tab2fx

==============

output, 1st time:

>SpeciesA.seq1

 
CCTCTTTA


>SpeciesB.seq1

GCCCTTTA
>SpeciesC.seq1
GCCCTTTA

output, 2nd time:

>SpeciesA.seq3
CCGCTTTA
>SpeciesB.seq1
GCCCTTTA
>SpeciesC.seq1
GCCCTTTA

Recent Posts

Links

Apr 11, 2018 - Random fasta sequence generation