Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grep search strings with line breaks

Tags:

grep

bash

How to use grep to output occurrences of the string 'export to excel' in the input files given below? Specifically, how to handle the line breaks that happen in between the search strings? Is there a switch in grep that can do this or some other command probably?

Input files:

File a.txt:

blah blah ... export to
excel ...
blah blah..

File b.txt:

blah blah ... export to excel ...
blah blah..

like image 614
Vijay Dev Avatar asked Dec 07 '09 06:12

Vijay Dev


People also ask

How do you grep a line and next line?

You can use grep with -A n option to print N lines after matching lines. Using -B n option you can print N lines before matching lines. Using -C n option you can print N lines before and after matching lines.

How do you grep 3 lines after a match?

For BSD or GNU grep you can use -B num to set how many lines before the match and -A num for the number of lines after the match. If you want the same number of lines before and after you can use -C num . This will show 3 lines before and 3 lines after.

What does \b do in grep?

3.3 The Backslash Character and Special Expressions The ' \ ' character, when followed by certain ordinary characters, takes a special meaning: ' \b ' Match the empty string at the edge of a word.


2 Answers

Do you just want to find files that contain the pattern, ignoring linebreaks, or do you want to actually see the matching lines?

If the former, you can use tr to convert newlines to spaces:

tr '\n' ' ' | grep 'export to excel'

If the latter you can do the same thing, but you may want to use the -o flag to only print the actual match. You'll then want to adjust your regex to include any extra context you want.

like image 83
Laurence Gonsalves Avatar answered Oct 07 '22 03:10

Laurence Gonsalves


I don't know how to do this in grep. I checked the man page for egrep(1) and it can't match with a newline in the middle either.

I like the solution @Laurence Gonsalves suggested, of using tr(1) to wipe out the newlines. But as he noted, it will be a pain to print the matching lines if you do it that way.

If you want to match despite a newline and then print the matching line(s), I can't think of a way to do it with grep, but it would be not too hard in any of Python, AWK, Perl, or Ruby.

Here's a Python script that solves the problem. I decided that, for lines that only match when joined to the previous line, I would print a --> arrow before the second line of the match. Lines that match outright are always printed without the arrow.

This is written assuming that /usr/bin/python is Python 2.x. You can trivially change the script to work under Python 3.x if desired.

#!/usr/bin/python

import re
import sys

s_pat = "export\s+to\s+excel"
pat = re.compile(s_pat)

def print_ete(fname):
    try:
        f = open(fname, "rt")
    except IOError:
        sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
        sys.exit(2)

    prev_line = ""
    i_last = -10
    for i, line in enumerate(f):
        # is ete within current line?
        if pat.search(line):
            print "%s:%d: %s" % (fname, i+1, line.strip())
            i_last = i
        else:
            # construct extended line that included previous
            # note newline is stripped
            s = prev_line.strip("\n") + " " + line
            # is ete within extended line?
            if pat.search(s):
                # matched ete in extended so want both lines printed
                # did we print prev line?
                if not i_last == (i - 1):
                    # no so print it now
                    print "%s:%d: %s" % (fname, i, prev_line.strip())
                # print cur line with special marker
                print "-->  %s:%d: %s" % (fname, i+1, line.strip())
                i_last = i
        # make sure we don't match ete twice
        prev_line = re.sub(pat, "", line)

try:
    if sys.argv[1] in ("-h", "--help"):
        raise IndexError # print help
except IndexError:
    sys.stderr.write("print_ete <filename>\n")
    sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
            "export to excel")
    sys.exit(1)

print_ete(sys.argv[1])

EDIT: added comments.

I went to some trouble to make it print the correct line number on each line, using a format similar to what you would get with grep -Hn.

It could be much shorter and simpler if you don't need line numbers, and you don't mind reading in the whole file at once into memory:

#!/usr/bin/python

import re
import sys

# This pattern not compiled with re.MULTILINE on purpose.
# We *want* the \s pattern to match a newline here so it can
# match across multiple lines.
# Note the match group that gathers text around ete pattern uses a character
# class that matches anything but "\n", to grab text around ete.
s_pat = "([^\n]*export\s+to\s+excel[^\n]*)"
pat = re.compile(s_pat)

def print_ete(fname):
    try:
        text = open(fname, "rt").read()
    except IOError:
        sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
        sys.exit(2)

    for s_match in re.findall(pat, text):
        print s_match

try:
    if sys.argv[1] in ("-h", "--help"):
        raise IndexError # print help
except IndexError:
    sys.stderr.write("print_ete <filename>\n")
    sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
            "export to excel")
    sys.exit(1)

print_ete(sys.argv[1])
like image 34
steveha Avatar answered Oct 07 '22 01:10

steveha