Miller is a good, if not great software for cleaning up data. It is available in most of the linux repos or you can compile it from it's github source. Miller developer(s) claim(s) that "miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON."
Well, whatever it is, i found miller is useful in one use case. There are times we find our data in following format:
Protein AAA
family FAA
length 300
Protein BBB
family GAA
length 240
family FAA
length 300
Protein BBB
family GAA
length 240
In above example each protein description is divided into multiple sections and each section has details about each protein such as family, length. In other cases, it might be much more detailed. We may want this information as below:
Protein family length
AAA FAA 300
AAA FAA 300
BBB GAA 240
Miller does this kind of transformation very easy. Miller calls former (above) format as xtab and later format as markdown.
Command and output for above input would be:
$ mlr --ixtab --omd cat test2.txt
| Protein | family | length |
| --- | --- | --- |
| AAA | FAA | 300 |
| BBB | GAA | 240 |
Let us see how we can use this format conversion in biology. Let us say you have worked with KEGG proteins/enzymes and you have data in following format:
=================================================================
ENTRY EC 1.1.1.1 Enzyme
NAME alcohol dehydrogenase;
CLASS Oxidoreductases;
SYSNAME alcohol:NAD+ oxidoreductase
REACTION (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917
R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE primary alcohol [CPD:C00226];
PRODUCT aldehyde [CPD:C00071];
ENTRY EC 1.1.1.157 Enzyme
NAME 3-hydroxybutyryl-CoA dehydrogenase;
CLASS Oxidoreductases;
SYSNAME (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+
[RN:R01976]
ALL_REAC R01976;
SUBSTRATE (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT 3-acetoacetyl-CoA [CPD:C00332];
===================================================================
Let us what we have and what we want:
1) This data has two entries EC 1.1.1.1 and EC 1.1.1.157.
2) Each entry has multiple descriptions such as name, class, reaction etc
*Please note ALL_REAC and REACTIO was in single line and since blogger template was
not wrapping it up nicely, i broke it into two lines.
What we want is
1) Remove enzyme entry
2) Convert this to tabular format in such a way the all the characteristics of protein/enzyme
are present as top rows and corresponding information is below each characteristic/
property. code for this is:
$ sed 's/\s\+/\t /;s/.*ENT/\n&/g;s/ /\t/g' test.txt | cut -f1,2 | mlr --ixtab --omd cat |
sed '2d;s/| //;s/\s*|\s*/\t/g'
Output would look like (when opened with libreoffice):
Now the characteristics of proteins are on first/top row and corresponding protein
information is below each row. Corresponding code is
mlr --ixtab --omd cat
and rest of the code is formatting with sed and others. If I am familiar with mlr much more, I would have done that part too with mlr. In above mlr command, input (ixtab) is xtab and output is (omd) is markdown and the output is printed on screen (cat).