Biologist's bioinformatics notes

Some times, we need to search for multiline patterns spanning several lines. In this note, we would do this for a pattern spanning two lines. Let us take an example.

This a text file with chromosome number, start and stop coordinates and strand information. User would like to extract every line with +, preceded by -, in 4th column.

By this, user wants to extract records on negative (-) strand, followed by positive (+) strand.

chr1    1275000 1284999 +
chr1    1285000 1294999 -
chr1    1295000 1304999 -
chr1    1385000 1394999 -
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1435000 1444999 +
chr1    1715000 1724999 +
chr1    1725000 1734999 -
chr1    1735000 1744999 -
chr1    1745000 1754999 -
chr1    1795000 1804999 -
chr1    1805000 1814999 +
chr1    1815000 1824999 -
chr1    1865000 1874999 -

output should be:
chr1 1415000 1424999 -
chr1 1425000 1434999 +
chr1 1795000 1804999 -
chr1 1805000 1814999 +

Now let us do this in easiest way using pcregrep in shell:
===================================
$ pcregrep -M '\-$\n.*\+$' test1.txt
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +
===================================
pcregrep allows user to grep multiline pattern using -M option.
Now let us do this using grep, in shell:
===================================
$ grep -A 1 "-" test.txt | grep --no-group-separator -B 1 "+"
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +
=================================================
Now let us do this in R:
=================================================
> df1=read.csv("test.txt", sep="\t", stringsAsFactors = F, strip.white = T, header = F)
> df2=df1[grep("\\+", df1$V4,value = F)-1,]
> df3=df2[grep("\\-", df2$V4,value = F),]
> df1[sort(c(as.integer(row.names(df3)),as.integer(row.names(df3))+1)),]
     V1      V2      V3 V4
5 chr1 1415000 1424999 -
6 chr1 1425000 1434999 +
12 chr1 1795000 1804999 -
13 chr1 1805000 1814999 +
==============================================
Now let us do this in awk:
===============================================
$ awk '/+/ {if(s == "-") {print $0}} {s = $4 }' test1.txt | grep --no-group-separator -f - -B 1 test1.txt
chr1    1415000 1424999 -
chr1    1425000 1434999 +
chr1    1795000 1804999 -
chr1    1805000 1814999 +
===============================================

Recent Posts

Links

Sep 1, 2018 - Multiline pattern search using R, grep, pcregrep and awk