Biologist's bioinformatics notes

FastQ screen is a tool from Babraham Bionformatics group and is found here. FastQ screen allows user to screen for contaminations (artifacts) in fastq files. They may be from other organisms of your interest or from vector or primers.

FastQ screen program needs following information to work:

Source files (.fastq, .fq and gzipped files)
Indexed reference genome and genomes you want to look for (for eg. mouse, pig etc)
Indexed contaminant sequences (for eg primer sequences or vector sequences)
Aligners- currently it supports 3 aligners: bwa, bowtie and bowtie2.
Configuration file with reference genome index and any other indexes (primer, vector)

Since it requires aligner, reference and other fasta files must be indexed with same aligner that would be used in FastQ screen execution. For eg. if you are using bowtie2 aligner in executing FastQ screen, you need to index your files (.fa) with bowtie2.

Let us work with an example. Download FastQ screen from here. Download the raw files from here. Let us work with one file only as an example. The fastq files align well with Chr22 and download chr22.fa (hg38/GRCh38 is preferred) for indexing. When you index fasta files (with bowtie2) it would create index files with .bt2 extension.

Please keep source (.fa) and index files in individual folders as index files will be too many (5 index files for chr22). For eg. Keep reference chr22 fasta file in a separate folder and primer sequence (.fa) in a separate folder. Then index them.

Step 1: Index reference file. Here it is chr22.fa from hg38.
$ bowtie2-build -f hg38.chr22.fa hg38.chr22hg38.chr22.fa - Reference fasta

hg38.chr22 - Base name for index files

Step 2: Index primer sequences. Download Illumina primers sequences from here and Index them.
$ bowtie2-build -f gsaf_illumina_adapters.fasta illumina
gsaf_illumina_adapters.fasta - Illumina Adapter sequences

illumina - Base name for index files

When you download and gunzip FastQ screen source, you would be provided with an example conf. Try to reuse this configuration file. Configuration file used for this note is as below:

===================================================
$ cat fastq_screen.conf
############

## Threads #

############
THREADS 6
##############

## DATABASES #

##############
DATABASE Human /home/<user>/reference/hg38/hg38.chr22
DATABASE Adapters /home/<user>/Desktop

===================================================

User can add many databases to screen sequences against, as per the mentioned format:

DATABASE Human /home/<user>/reference/hg38/hg38.chr22

(Database, Organsim name, Location of index files and base name of index: in this case: hg38.chr22 -- refer to your bowtie2 index basename)

DATABASE Adapters /home/<user>/Desktop(Database, General name, Location of Adapters fasta file. Index files must be located in the same folder)

Note that for reference, you need to furnish location and basename of index and for primers, vectors you need to furnish location of fasta file it self.

Final command to execute FastQ screen is:

$ ./fastq_screen -conf fastq_screen.conf /home/<>/raw_data/hcc1395_normal_rep1_r1.fastq.gz Please note that for this note, I have used only human sequence (chr22) and primer sequence only. Output would be in 3 formats: html, png and txt.

It would be as follows:
png

Text:

$ cat hcc1395_normal_rep1_r1_screen.txt

#Fastq_screen version: 0.11.1    #Aligner: bowtie2    #Reads in subset: 100000
Genome    #Reads_processed    #Unmapped    %Unmapped    #One_hit_one_genome    %One_hit_one_genome    #Multiple_hits_one_genome    %Multiple_hits_one_genome    #One_hit_multiple_genomes    %One_hit_multiple_genomes    Multiple_hits_multiple_genomes    %Multiple_hits_multiple_genomes
Human    110652    1306    1.18    64380    58.18    44966    40.64    0    0.00    0    0.00

%Hit_no_genomes: 1.18

html (screenshot):

Recent Posts

Links

Sep 19, 2017 - FastQ screen and NGS data analysis