With NGS, sequence files (fasta format) have become large and parsing fasta file for sequences within, has become easier with faidx implemented in samtools. In this page, seqkit and kentutils are used most. They can be downloaded from here and here.
Following are the steps to extract a region of interest from huge fasta (with .fasta/.fa extension) file (in GNU-linux)
1) Index fasta file with samtools
Command line: $ samtools faidx <reference file>
Example: $ samtools faidx ref.fa
2) To view the indexed files, list the files in the directory. A new file with .fai extension should be available along with original file
For eg. If supplied reference file (ref.fa in above example) has .fa as extension, user should see a new file with .fai extension (example file that is created new: ref.fa.fai)
3) Extract the sequence of your interest with following command:
Command line: $ samtools faidx <reference> <region of interest>
Example code: $ samtools faidx ref.fa chr17:7668401-7687549 (for extracting TP53 sequence from ref.fa. Example ref.fa is hg19)
Make sure that fasta headers (chromosome/contigs) match while querying indexed fasta. If fasta file has chromosome name as chr17 (in the header), then use the same while extracting regions of interest. If fasta file has only chromosome number in header, change the query as per the chromosome header. The same would apply for
Example code:
1) $ samtools faidx ref.fa chr17:7668401-7687549 (If chromosome headers start with >chr17 heading)
2) $ samtools faidx ref.fa 17:7668401-7687549 (If chromosome headers start with >17 heading)
Following are the steps to extract a region of interest from huge fasta (with .fasta/.fa extension) file (in GNU-linux)
1) Index fasta file with samtools
Command line: $ samtools faidx <reference file>
Example: $ samtools faidx ref.fa
2) To view the indexed files, list the files in the directory. A new file with .fai extension should be available along with original file
For eg. If supplied reference file (ref.fa in above example) has .fa as extension, user should see a new file with .fai extension (example file that is created new: ref.fa.fai)
3) Extract the sequence of your interest with following command:
Command line: $ samtools faidx <reference> <region of interest>
Example code: $ samtools faidx ref.fa chr17:7668401-7687549 (for extracting TP53 sequence from ref.fa. Example ref.fa is hg19)
Make sure that fasta headers (chromosome/contigs) match while querying indexed fasta. If fasta file has chromosome name as chr17 (in the header), then use the same while extracting regions of interest. If fasta file has only chromosome number in header, change the query as per the chromosome header. The same would apply for
Example code:
1) $ samtools faidx ref.fa chr17:7668401-7687549 (If chromosome headers start with >chr17 heading)
2) $ samtools faidx ref.fa 17:7668401-7687549 (If chromosome headers start with >17 heading)
4) Count the sequences in fasta file: Let us assume that there are several fasta sequences in file and you would like to count the number of sequences within the file
Example fasta can be downloaded here. This fasta file has two sequences in it and is named as test2.fa
code: $ seqkit seq test2.fa -n | wc -l
this would print the number of fasta sequences in the file
5) Print the sequence names within a fasta file
code: $ seqkit seq test2.fa -n
This would print all the lines with >. This would equal to the number of sequences in fasta file
6) Count the bases in fasta file
code: $ faCount test2.fa
This would print number of A,T,G,C,N and CpG islands in a given sequence. For above example it would be:
#seq len A C G T N cpg
GQ118297.1 496 183 48 89 176 0 7
GQ118277.1 697 175 154 180 177 11 47
total 1193 358 202 269 353 11 54
faCount is part of UCSC kentutils
6) Average size of the sequence in a file with multiple fasta sequences
code: $ seqkit stats test2.fa
This would print statistics of the fasta file. For example above (test2.fa), output would be:
file format type num_seqs sum_len min_len avg_len max_len
test2.fa FASTA DNA 2 1,193 496 596.5 697
Same can be done with faSize from kentutils.
Code: $ faSize test2.fa
7) Search for a pattern in a fasta file
code: $ seqkit locate -idp TATATGA test2.fa
In the above example, i am searching for a small motif (TATATGA) in test2.fa.Code includes ignore case, degenerate bases and pattern options. Ignore case ignores the case, degenerate bases look for degenerate bases at any given position and pattern tells the program that this is the pattern (TATATGA) i am looking for.
8) Search for a patten in a fasta file and identify the gene in which the patten is located
code:
$ seqkit locate -i -p ccttctctgggccttgatttcccctcctgc chr12.fa --bed
| bedtools intersect -a - -b ../reference/chr12/genes_chr12.gtf -wb
| bedtools intersect -a - -b ../reference/chr12/genes_chr12.gtf -wb
This code is looking for a pattern in human chromosome 12 (fa file) in first partand in second part this code is intersecting gtf file with genes and exons obtained from UCSC and the pattern is (ccttctctgggccttgatttcccctcctgc). Output is:
chr12 186551 186581 ccttctctgggccttgatttcccctcctgc 0 + chr12 unknown exon
186542 186878 . + . gene_id "IQSEC3"; gene_name "IQSEC3"; p_id "P13619";
transcript_id "NM_015232"; tss_id "TSS12565"; (in a single line)9) Extract only first 10 bases from fasta
code: $ seqkit subseq -r 1:10 test2.fa