With NGS, sequence files (fasta format) have become large and parsing fasta file for sequences within, has become easier with faidx implemented in samtools. In this page, seqkit and kentutils are used most. They can be downloaded from here and here.

Following are the steps to extract a region of interest from huge fasta (with .fasta/.fa extension) file (in GNU-linux)

1)  Index fasta file with samtools

Command line: $ samtools  faidx <reference file>
Example: $ samtools  faidx ref.fa

2) To view the indexed files, list the files in the directory. A new file with .fai extension should be available along with original file

For eg. If supplied reference file (ref.fa in above example) has .fa as extension, user should see a new file with .fai extension (example file that is created new: ref.fa.fai)

3) Extract the sequence of your interest with following command:

Command line: $ samtools faidx <reference> <region of interest>
Example code: $ samtools faidx ref.fa chr17:7668401-7687549 (for extracting TP53 sequence from ref.fa. Example ref.fa is hg19)

Make sure that fasta headers (chromosome/contigs)  match while querying indexed fasta. If fasta file has chromosome name as chr17 (in the header), then use the same while extracting regions of interest. If fasta file has only chromosome number in header, change the query as per the chromosome header. The same would apply for

Example code:

1) $ samtools faidx ref.fa chr17:7668401-7687549 (If chromosome headers start with >chr17 heading)
2) $ samtools faidx ref.fa 17:7668401-7687549 (If chromosome headers start with >17 heading)

4) Count the sequences in fasta file: Let us assume that there are several fasta sequences in file and you would like to count the number of sequences within the file

Example fasta can be downloaded here. This fasta file has two sequences in it and is named as test2.fa

code: $ seqkit seq test2.fa -n | wc -l

this would print the number of fasta sequences in the file

5) Print the sequence names within a fasta file

code: $ seqkit seq test2.fa -n

This would print all the lines with >. This would equal to the number of sequences in  fasta file

6) Count the bases in fasta file

code: $ faCount test2.fa

This would print number of A,T,G,C,N and CpG islands in a given sequence. For above example it would be:

#seq                  len      A        C       G      T       N        cpg
GQ118297.1    496    183    48      89    176    0        7
GQ118277.1    697    175    154    180  177    11     47
total                  1193   358    202    269  353    11    54

faCount is part of UCSC kentutils


6) Average size of the sequence in a file with multiple fasta sequences

code: $ seqkit stats test2.fa

This would print statistics of the fasta file. For example above (test2.fa), output would be:

file          format  type            num_seqs  sum_len  min_len  avg_len      max_len
test2.fa  FASTA   DNA          2                 1,193      496           596.5             697

Same can be done with faSize from kentutils.

Code: $ faSize test2.fa

7) Search for a pattern in a fasta file

code: $ seqkit locate -idp TATATGA test2.fa

In the above example, i am searching for a small motif (TATATGA) in test2.fa.Code includes ignore case, degenerate bases and pattern options. Ignore case ignores the case, degenerate bases look for degenerate bases at any given position and pattern tells the program that this is the pattern (TATATGA) i am looking for.

8) Search for a patten in a fasta file and identify the gene in which the patten is located

code:
$ seqkit locate -i -p ccttctctgggccttgatttcccctcctgc chr12.fa --bed
| bedtools intersect -a - -b ../reference/chr12/genes_chr12.gtf -wb
 
This code is looking for a pattern in human chromosome 12 (fa file) in first part
and in second part this code is intersecting gtf file with genes and exons obtained from UCSC and the pattern is (ccttctctgggccttgatttcccctcctgc). Output is:

chr12 186551 186581 ccttctctgggccttgatttcccctcctgc 0 + chr12 unknown exon 186542 186878 . + . gene_id "IQSEC3"; gene_name "IQSEC3"; p_id "P13619"; transcript_id "NM_015232"; tss_id "TSS12565"; (in a single line)

9) Extract only first 10 bases from fasta
code: $ seqkit subseq -r 1:10 test2.fa