Biologist's bioinformatics notes

Biologists regularly deal with sequences and most of them come in different formats and different arrangements depending upon the maintainer. Two of the requirements I have come across are as follows:

Separate all the fasta sequences as per organism, irrespective of strain and contig assembly
Rename the organism as per the file from NCBI

Let me give you few examples of the file headers and show the issue is:
==========================================================================

>NZ_SADZ01000474.1 Ochrobactrum sp. AV Scaffold_480, whole genome shotgun sequence
>NZ_PHFU01000092.1 Xylella fastidiosa subsp. fastidiosa strain CFBP8351 Xf_LSV462693, whole genome shotgun sequence
>NZ_SACX01000165.1 Escherichia coli strain JEONG-9595 NODE_13_length_103668_cov_11.7541_ID_25, whole genome shotgun sequence
>NZ_SADZ01000789.1 Ochrobactrum sp. AV Scaffold_806, whole genome shotgun sequence
>NZ_SADZ01000790.1 Ochrobactrum sp. AV Scaffold_807, whole genome shotgun sequence
>NZ_SADZ01000791.1 Ochrobactrum sp. AV Scaffold_808, whole genome shotgun sequence

========================================================================

If you look at the headers you would see that headers contain Organism name, one entry for each scaffold and each of the sequences are present in individual files and few of them are in a single file.

NCBI data file that would be used in changing names is present here. Matching entries are copy/pasted here:
=========================================================================

GCF_004011925.1    PRJNA224116 SAMN10688388    SADZ00000000.1 na 2500158 2500158 Ochrobactrum sp. AV strain=AV       latest Scaffold    Major   Full    1/11/2019   ASM401192v1 National Environmental Engineering Research Institute   GCA_004011925.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/011/925/GCF_004011925.1_ASM401192v1
GCF_004016405.1 PRJNA224116 SAMN07999362    PHFU00000000.1 na 644356 2371    Xylella fastidiosa subsp. fastidiosa    strain=CFBP8351     latest Contig Major   Full    1/14/2019   ASM401640v1 INRA    GCA_004016405.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/016/405/GCF_004016405.1_ASM401640v1
GCF_004022065.1 PRJNA224116 SAMN04160771    SACX00000000.1 na 562 562 Escherichia coli    strain=JEONG-9595       latest Contig Major   Full    1/14/2019   ASM402206v1 US Food and Drug Administration GCA_004022065.1 identical   ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/004/022/065/GCF_004022065.1_ASM402206v1

=================================================================

Now the task is:

Create one fasta file for each organism and put all the assembly (scaffold) sequences within that file
Rename the fasta file names with organism names from NCBI summary file.

Workflow would be:

Extract all the organism related fasta sequences using four letter symbol and name the fasta file with four letter code
Replace four letter code in file names with those from NCBI summary file.

Here is the code:

===============================================================

seqkit seq -w 0 *.fa | seqkit split -i --id-regexp "^\w+_([A-z]+)[0-9]+" -O split_fasta
cd split_fasta
cut -f4,8 <path>/assembly_summary_refseq.txt > mapping.txt
while read before after; do mv ${before%%[0-9]*\.[0-9]}.fasta "$after.fasta";done < mapping.txt

================================================================

Here is the explanation:

================================================================

Seqkit concatenates all the fasta files and splits the files based on 4 letters , after first three characters (from NZ_SADZ01000474.1 to SADZ) of the headers. All the files will have following names: test.id_fourlettersfromheader.fa and are stored in `split_fasta` folder (for eg. test.id_PHFU.fa). Seqkit automagically creates output folder. If you are in doubt, you can run seqkit function in `--dry-run` mode.
change the directory newly created split_fasta directory
cut function prints only 4 and 8 columns (tab separated) from the NCBI file and stores the 4 letter code and strain names in "mapping test" file.
Changes the files names as per OP with restricted characters. Read function default read mode is tab separated.

==================================================================

Output for the code would be (with above fasta files):

===============================================================

tree .                  
.
├── Escherichia coli strain=JEONG-9595.fasta
├── mapping.txt
├── Ochrobactrum sp. AV strain=AV.fasta
└── Xylella fastidiosa subsp. fastidiosa strain=CFBP8351.fasta

===================================================================

Now the file names will have characters (\,=) that are not easy to parse with downstream tools. So edit the file from step 3, to change these characters to appropriate delimiters (for eg underscore). Step 4 can be done with parallel as well: $ parallel --colsep '\t' --dry-run mv {=1s/\[0-9\]\+\.\[0-9\]//=}.fasta {2}.fasta :::: mapping.txt

Recent Posts

Links

Sep 3, 2020 - Reorganize and rename fasta