In one of my earlier posts, I posted an example zero length assertions in R. In this note, we have examples for zero length assertions in bash and python 3. Let us take an example file first and the run the code.
===============================================
Protein of unknown function (DUF1466) RCS1 #N/A
Ras family Small GTPase superfamily GO:0003924|GO:0005525
Sugar-tranasporters, 12 TM Molybdate-anion transporter GO:0015098|GO:0015689|GO:0016021
Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter GO:0015293|GO:0016021
mTERF Transcription termination factor, mitochondrial/chloroplastic GO:0003690|GO:0006355
==============================================
Now the user wants every thing before GO terms and #N/A. The easier way is to cut last column out with awk or bash. However, let us use look ahead to get want we want. Let us start with bash (file name is lookaround.txt):
====================================
$ grep -Po '.*(?=\t[GO|\#NA])' lookaround.txt
====================================
This code looks for string "GO" or "#NA" preceded by a tab. This would print:
====================================
Protein of unknown function (DUF1466) RCS1
Ras family Small GTPase superfamily
Sugar-tranasporters, 12 TM Molybdate-anion transporter
Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter
mTERF Transcription termination factor, mitochondrial/chloroplastic
====================================
Let us do it in python 3 (mind your indents):
===============================
>import re
>with open("lookaround.txt", "r") as f:
test = f.readlines()
>bout = [re.search(r'.*\t(?=[GO|\\#NA])', i).group(0) for i in test]
>print(*bout, sep='\n')
======================================
output is:
===========================================
Protein of unknown function (DUF1466) RCS1
Ras family Small GTPase superfamily
Sugar-tranasporters, 12 TM Molybdate-anion transporter
Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter
mTERF Transcription termination factor, mitochondrial/chloroplastic
==========================================
===============================================
Protein of unknown function (DUF1466) RCS1 #N/A
Ras family Small GTPase superfamily GO:0003924|GO:0005525
Sugar-tranasporters, 12 TM Molybdate-anion transporter GO:0015098|GO:0015689|GO:0016021
Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter GO:0015293|GO:0016021
mTERF Transcription termination factor, mitochondrial/chloroplastic GO:0003690|GO:0006355
==============================================
Now the user wants every thing before GO terms and #N/A. The easier way is to cut last column out with awk or bash. However, let us use look ahead to get want we want. Let us start with bash (file name is lookaround.txt):
====================================
$ grep -Po '.*(?=\t[GO|\#NA])' lookaround.txt
====================================
This code looks for string "GO" or "#NA" preceded by a tab. This would print:
====================================
Protein of unknown function (DUF1466) RCS1
Ras family Small GTPase superfamily
Sugar-tranasporters, 12 TM Molybdate-anion transporter
Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter
mTERF Transcription termination factor, mitochondrial/chloroplastic
====================================
Let us do it in python 3 (mind your indents):
===============================
>import re
>with open("lookaround.txt", "r") as f:
test = f.readlines()
>bout = [re.search(r'.*\t(?=[GO|\\#NA])', i).group(0) for i in test]
>print(*bout, sep='\n')
======================================
output is:
===========================================
Protein of unknown function (DUF1466) RCS1
Ras family Small GTPase superfamily
Sugar-tranasporters, 12 TM Molybdate-anion transporter
Sodium:dicarboxylate symporter family Sodium:dicarboxylate symporter
mTERF Transcription termination factor, mitochondrial/chloroplastic
==========================================