Biologist's bioinformatics notes

Miller is a good, if not great software for cleaning up data. It is available in most of the linux repos or you can compile it from it's github source. Miller developer(s) claim(s) that "miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON."

Well, whatever it is, i found miller is useful in one use case. There are times we find our data in following format:

Protein AAA
family FAA
length 300

Protein BBB
family GAA
length 240

In above example each protein description is divided into multiple sections and each section has details about each protein such as family, length. In other cases, it might be much more detailed. We may want this information as below:

Protein family length

AAA FAA 300

BBB GAA 240

Miller does this kind of transformation very easy. Miller calls former (above) format as xtab and later format as markdown.

Command and output for above input would be:

$ mlr --ixtab --omd cat test2.txt

| Protein | family | length |
| --- | --- | --- |
| AAA | FAA | 300 |
| BBB | GAA | 240 |

Let us see how we can use this format conversion in biology. Let us say you have worked with KEGG proteins/enzymes and you have data in following format:

=================================================================

ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 
                      R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ 
                      [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

===================================================================

Let us what we have and what we want:

1) This data has two entries EC 1.1.1.1 and EC 1.1.1.157.

2) Each entry has multiple descriptions such as name, class, reaction etc

*Please note ALL_REAC and REACTIO was in single line and since blogger template was

not wrapping it up nicely, i broke it into two lines.

What we want is

1) Remove enzyme entry

2) Convert this to tabular format in such a way the all the characteristics of protein/enzyme 
are present as top rows and corresponding information is below each characteristic/
property. code for this is:

$ sed 's/\s\+/\t /;s/.*ENT/\n&/g;s/          /\t/g' test.txt | cut -f1,2 | mlr --ixtab --omd cat |

sed '2d;s/| //;s/\s*|\s*/\t/g'

Output would look like (when opened with libreoffice):

Now the characteristics of proteins are on first/top row and corresponding protein

 information is below each row. Corresponding code is  mlr --ixtab --omd cat and 
rest of the code is formatting with sed and others. If I am familiar with mlr much more, 
I would have done that part too with mlr. In above mlr command, input (ixtab) is xtab 
and output is (omd) is markdown and the output is printed on screen (cat).

Recent Posts

Links

Nov 19, 2017 - Parse KEGG output with miller