Generic format to store the sequences (read sequences) post sequencing is .fastq. There are several tools to parse fastq files. One of them is seqtk. Fastq formatted files store lots of useful information like sequencing machine, lane, flowcell, index sequence, quality  filter, sequence, sequence quality etc. Hence before starting analysis, it is better to have a look at the fastq file. Download the example rnaseq files from here.  Description about these files are provided in the same page. Please go through it to understand the source of the samples.


Print first 4 lines of fastq.gz
  • $ seqkit seq hcc1395_normal_rep1_r1.fastq.gz | head -4
 Print stats of a fastq file
  •  $ seqkit stats hcc1395_normal_rep1_r1.fastq.gz
Print all the reads in a fastq
  • $ seqkit seq hcc1395_normal_rep1_r1.fastq.gz -n
Count the number of reads
  • $ seqkit seq hcc1395_normal_rep1_r1.fastq.gz -n | wc -l 
 Print the reads in each file for all 12 files (works in bash shell). If you do not want to use seqkit, you can use zgrep program. Assumption here is that all reads come from same machine.
  • $ for i in *.gz; do echo $i;  seqkit seq $i -n | wc -l; done | paste - - 
  • $ for i in *.gz; do echo $i;  zgrep -P "^\@K00193" $i  | wc -l; done | paste - - 
Base frequency in any given read (for each nucleotide):
  • $ seqkit seq hcc1395_normal_rep1_r1.fastq.gz -s  | head -1 | | fold -w1 | sort | uniq -c
 Count frequency of bases in a fastq file:
  •   $ seqkit seq hcc1395_normal_rep1_r1.fastq.gz -s   | fold -w1 | sort | uniq -c
 This code will print frequency per base