Biologist's bioinformatics notes

In one of my earlier posts, I posted an example zero length assertions in R. In this note, we have examples for zero length assertions in bash and python 3. Let us take an example file first and the run the code.
===============================================
Protein of unknown function (DUF1466)    RCS1    #N/A
Ras family    Small GTPase superfamily    GO:0003924|GO:0005525
Sugar-tranasporters, 12 TM    Molybdate-anion transporter    GO:0015098|GO:0015689|GO:0016021
Sodium:dicarboxylate symporter family    Sodium:dicarboxylate symporter    GO:0015293|GO:0016021
mTERF    Transcription termination factor, mitochondrial/chloroplastic    GO:0003690|GO:0006355
==============================================
Now the user wants every thing before GO terms and #N/A. The easier way is to cut last column out with awk or bash. However, let us use look ahead to get want we want. Let us start with bash (file name is lookaround.txt):
====================================
$ grep -Po '.*(?=\t[GO|\#NA])' lookaround.txt
====================================
This code looks for string "GO" or "#NA" preceded by a tab. This would print:
====================================
Protein of unknown function (DUF1466)    RCS1
Ras family    Small GTPase superfamily
Sugar-tranasporters, 12 TM    Molybdate-anion transporter
Sodium:dicarboxylate symporter family    Sodium:dicarboxylate symporter
mTERF    Transcription termination factor, mitochondrial/chloroplastic
====================================
Let us do it in python 3 (mind your indents):
===============================
>import re
>with open("lookaround.txt", "r") as f:
    test = f.readlines()
>bout = [re.search(r'.*\t(?=[GO|\\#NA])', i).group(0) for i in test]
>print(*bout, sep='\n')
======================================
output is:
===========================================
Protein of unknown function (DUF1466)   RCS1
Ras family Small GTPase superfamily
Sugar-tranasporters, 12 TM Molybdate-anion transporter
Sodium:dicarboxylate symporter family   Sodium:dicarboxylate symporter
mTERF   Transcription termination factor, mitochondrial/chloroplastic
==========================================

Recent Posts

Links

Aug 4, 2018 - Zero length assertions in bash and python 3