Biologist's bioinformatics notes

Researchers working with NGS data store aligned reads in a format called sam and it's binary format is bam. Both are interconvertible. Sam file has several information and for reference, please refer to sam format here (in pdf format). One of the features of sam format is to tag sequences with appropriate tags. Some times, researchers may have to extract these tags and store it. One of the questions I came across was how to extract sequence, two tags (BZ and BQ- a phred quality score and base alignment score consecutively) and store it as fastq format. In directly, this is is equivalent to converting a sam record to fastq record.

Let us do this in shell. One of the reasons I prefer scripts compared to programs is that scripts can be easily understood and can be easily fixed. But downside is that they are slow and most of the times not cross and/or backward compatible:

Example of sam record:

m54071222/4194368/0_197 4 * 0 255 * * 0 0 AAGAGGAAGGGGGAGAGAGAGGAGGAGAGGGGGGAAGAGGTTGGGATGGAAA

ATAGGTGGTTAGAGGGAGAAAGG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! np:i:1 qe:i:197 qs:i:0 rq:f:0 BS:Z:ATGGCCAATTGCAGAA BQ:Z:JJJKKKKKKKKLLLLL zm:i:4194368 RG:Z:1f1bf15c sc:A:L sz:A:N

You can download the example sam with one record here. Note that tags can appear in random order. For convenience, command line script is broken into multilines:

============================================================

$ paste <(awk '{print $1}' test.sam) \

<( grep -wo "\<BS\W\w\W\w\+\>" test.sam) \

<(grep -wo "\<BQ\W\w\W\w\+\>" test.sam) | sed 's/:/\t/g' | \

awk -v OFS="\n" -v ORS="\n\n" '{print "@"$1,$4,"+",$7}'

output:

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL

@m54071222/4194368/0_197
ATGGCCAATTGCAGAA
+
JJJKKKKKKKKLLLLL
================================================================

This function/ command has several things going on. Let me explain it here:

Script has 3 actions: paste, replacement by sed and output formatting by awk
paste command takes two grep functions and one awk function.
Grep functions are grepping for tags (BS and BQ) using word (-w) argument, one time for each argument
Awk function is printing first column (read ID)
Sed function is breaking down the tags by full colon so that tags are broken into 3 pieces (from BS:Z:ATGGCCAATTGCAGAA to BS, Z and ATGGCCAATTGCAGAA)
Use awk to extract necessary columns. Append @ to the ID, print BS sequence, print + and finally BZ sequence. Each field (column) will be separated by newline so that each line (from sam) is converted to fastq and each record (each line in sam) will be separated by two new lines. User can remove this two lines if necessary. Logic was to insert to two lines separating each fastq record, for better reading.

I think this is the logic behind sam2fastq function if you ever want to convert your sam file to fastq format using any of the fields. However remember that this is for single ended, not for paired sequencing. Note that when you convert a sam record to fastq format, it must adhere fastq format standards so that downstream tools can deal with fastq files appropriately.

Recent Posts

Links

Sep 2, 2018 - Sam to fastq conversion and tag extraction