Sequence information of genes and proteins are stored in fasta
format. These fasta files are text files with preliminary markups.
Several data manipulations can be done on these files. One of them is
random printing of sequence files. Let us say there are 10 species and
each species has 4 sequences. OP wanted to publish sequences random for
each species. In the following example, there are 3 species, each
species has certain number sequences (.seq{0-4}). OP wants each species
and one random sequence for each species, in fasta format.
===========
Using datamash and sesqkit:
output, 1st time:
$ cat test.fa
>SpeciesA.seq0
CCACTTTA
>SpeciesA.seq1
CCTCTTTA
>SpeciesA.seq2
CCGCTTTA
>SpeciesA.seq3
CCACTTTA
>SpeciesB.seq0
GCCCTTTA
>SpeciesB.seq1
GCCCTTTA
>SpeciesB.seq2
ACCCTTTA
>SpeciesB.seq3
GCCCTTTA
>SpeciesC.seq0
GCCCTTTA
>SpeciesC.seq1
GCCCTTTA
For this we need seqkit tool. Use bash shell for execution=============================================================
$ for i in $(seqkit seq -n test.fa | cut -f1 -d"." | uniq);
do seqkit grep -rp $i test.fa | seqkit seq -n | shuf -n1 | grep -A1 -f - test.fa;
done;
============================================================
Another way of doing the same but using awk and datamash:===========
$ awk 'BEGIN{RS=">"}{print $1"\t"$2}' test.fa |sed -e '1d;s/\./\t/g'| datamash -sg1 rand 2,3 | sed 's/\t/\./' | awk 'BEGIN{RS="\n"}{print ">"$1"\n"$2}'
========== Using datamash and sesqkit:
===========================
$ seqkit fx2tab test.fa | sed 's/\./\t/' | datamash -sg1 rand 2,3 | sed 's/\t/\./' | seqkit tab2fx
============== output, 1st time:
>SpeciesA.seq1
CCTCTTTA
>SpeciesB.seq1
GCCCTTTA
>SpeciesC.seq1
GCCCTTTA
output, 2nd time:>SpeciesA.seq3
CCGCTTTA
>SpeciesB.seq1
GCCCTTTA
>SpeciesC.seq1
GCCCTTTA