Some times, we need to search for multiline patterns spanning several lines. In this note, we would do this for a pattern spanning two lines. Let us take an example.
This a text file with chromosome number, start and stop coordinates and strand information. User would like to extract every line with +, preceded by -, in 4th column.
By this, user wants to extract records on negative (-) strand, followed by positive (+) strand.
chr1 1275000 1284999 +
chr1 1285000 1294999 -
chr1 1295000 1304999 -
chr1 1385000 1394999 -
chr1 1415000 1424999 -
chr1 1425000 1434999 +
chr1 1435000 1444999 +
chr1 1715000 1724999 +
chr1 1725000 1734999 -
chr1 1735000 1744999 -
chr1 1745000 1754999 -
chr1 1795000 1804999 -
chr1 1805000 1814999 +
chr1 1815000 1824999 -
chr1 1865000 1874999 -
output should be:chr1 1415000 1424999 -
chr1 1425000 1434999 +
chr1 1795000 1804999 -
chr1 1805000 1814999 +
Now let us do this in easiest way using pcregrep in shell:
===================================
$ pcregrep -M '\-$\n.*\+$' test1.txt
chr1 1415000 1424999 -
chr1 1425000 1434999 +
chr1 1795000 1804999 -
chr1 1805000 1814999 +
===================================
pcregrep allows user to grep multiline pattern using -M option.
Now let us do this using grep, in shell:
===================================
$ grep -A 1 "-" test.txt | grep --no-group-separator -B 1 "+"
chr1 1415000 1424999 -
chr1 1425000 1434999 +
chr1 1795000 1804999 -
chr1 1805000 1814999 +
=================================================
Now let us do this in R:
=================================================
> df1=read.csv("test.txt", sep="\t", stringsAsFactors = F, strip.white = T, header = F)
> df2=df1[grep("\\+", df1$V4,value = F)-1,]
> df3=df2[grep("\\-", df2$V4,value = F),]
> df1[sort(c(as.integer(row.names(df3)),as.integer(row.names(df3))+1)),]
V1 V2 V3 V4
5 chr1 1415000 1424999 -
6 chr1 1425000 1434999 +
12 chr1 1795000 1804999 -
13 chr1 1805000 1814999 +
==============================================
Now let us do this in awk:
===============================================
$ awk '/+/ {if(s == "-") {print $0}} {s = $4 }' test1.txt | grep --no-group-separator -f - -B 1 test1.txt
chr1 1415000 1424999 -
chr1 1425000 1434999 +
chr1 1795000 1804999 -
chr1 1805000 1814999 +
===============================================