Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get lines before and after matching from java 8 stream like grep?

I have a text files that have a lot of string lines in there. If I want to find lines before and after a matching in grep, I will do like this:

grep -A 10 -B 10 "ABC" myfile.txt

How can I implements the equivalent in Java 8 using stream?

like image 410
Sean Nguyen Avatar asked Oct 04 '15 06:10

Sean Nguyen


3 Answers

If you're willing to use a third party library and don't need parallelism, then jOOλ offers SQL-style window functions as follows

Seq.seq(Files.readAllLines(Paths.get(new File("/path/to/Example.java").toURI())))
   .window(-1, 1)
   .filter(w -> w.value().contains("ABC"))
   .forEach(w -> {
       System.out.println("-1:" + w.lag().orElse(""));
       System.out.println(" 0:" + w.value());
       System.out.println("+1:" + w.lead().orElse(""));
       // ABC: Just checking
   });

Yielding

-1:       .window(-1, 1)
 0:       .filter(w -> w.value().contains("ABC"))
+1:       .forEach(w -> {
-1:           System.out.println("+1:" + w.lead().orElse(""));
 0:           // ABC: Just checking
+1:       });

The lead() function accesses the next value in traversal order from the window, the lag() function accesses the previous row.

Disclaimer: I work for the company behind jOOλ

like image 195
Lukas Eder Avatar answered Nov 04 '22 15:11

Lukas Eder


Such scenario is not well-supported by Stream API as existing methods do not provide an access to the element neighbors in the stream. The closest solution which I can think up without creating custom iterators/spliterators and third-party library calls is to read the input file into List and then use indices Stream:

List<String> input = Files.readAllLines(Paths.get(fileName));
Predicate<String> pred = str -> str.contains("ABC");
int contextLength = 10;

IntStream.range(0, input.size()) // line numbers
    // filter them leaving only numbers of lines satisfying the predicate
    .filter(idx -> pred.test(input.get(idx))) 
    // add nearby numbers
    .flatMap(idx -> IntStream.rangeClosed(idx-contextLength, idx+contextLength))
    // remove numbers which are out of the input range
    .filter(idx -> idx >= 0 && idx < input.size())
    // sort numbers and remove duplicates
    .distinct().sorted()
    // map to the lines themselves
    .mapToObj(input::get)
    // output
    .forEachOrdered(System.out::println);

The grep output also includes special delimiter like "--" to designate the omitted lines. If you want to go further and mimic such behavior as well, I can suggest you to try my free StreamEx library as it has intervalMap method which is helpful in this case:

// Same as IntStream.range(...).filter(...) steps above
IntStreamEx.ofIndices(input, pred)
    // same as above
    .flatMap(idx -> IntStream.rangeClosed(idx-contextLength, idx+contextLength))
    // remove numbers which are out of the input range
    .atLeast(0).less(input.size())
    // sort numbers and remove duplicates
    .distinct().sorted()
    .boxed()
    // merge adjacent numbers into single interval and map them to subList
    .intervalMap((i, j) -> (j - i) == 1, (i, j) -> input.subList(i, j + 1))
    // flatten all subLists prepending them with "--"
    .flatMap(list -> StreamEx.of(list).prepend("--"))
    // skipping first "--"
    .skip(1)
    .forEachOrdered(System.out::println);
like image 38
Tagir Valeev Avatar answered Nov 04 '22 13:11

Tagir Valeev


As Tagir Valeev noted, this kind of problem isn't well supported by the streams API. If you incrementally want to read lines from the input and print out matching lines with context, you'd have to introduce a stateful pipeline stage (or a custom collector or spliterator) which adds quite a bit of complexity.

If you're willing to read all the lines into memory, it turns out that BitSet is a useful representation for manipulating groups of matches. This bears some similarity to Tagir's solution, but instead of using integer ranges to represent lines to be printed, it uses 1-bits in a BitSet. Some advantages of BitSet are that it has a number of built-in bulk operations, and it has a compact internal representation. It can also produce a stream of indexes of the 1-bits, which is quite useful for this problem.

First, let's start out by creating a BitSet that has a 1-bit for each line that matches the predicate:

void contextMatch(Predicate<String> pred, int before, int after, List<String> input) {
    int len = input.size();
    BitSet matches = IntStream.range(0, len)
                              .filter(i -> pred.test(input.get(i)))
                              .collect(BitSet::new, BitSet::set, BitSet::or);

Now that we have the bit set of matching lines, we stream out the indexes of each 1-bit. We then set the bits in the bitset that represent the before and after context. This gives us a single BitSet whose 1-bits represent all of the lines to be printed, including context lines.

    BitSet context = matches.stream()
        .collect(BitSet::new,
                 (bs,i) -> bs.set(Math.max(0, i - before), Math.min(i + after + 1, len)),
                 BitSet::or);

If we just want to print out all the lines, including context, we can do this:

    context.stream()
           .forEachOrdered(i -> System.out.println(input.get(i)));

The actual grep -A a -B b command prints a separator between each group of context lines. To figure out when to print a separator, we look at each 1-bit in the context bit set. If there's a 0-bit preceding it, or if it's at the very beginning, we set a bit in the result. This gives us a 1-bit at the beginning of each group of context lines:

    BitSet separators = context.stream()
                               .filter(i -> i == 0 || !context.get(i-1))
                               .collect(BitSet::new, BitSet::set, BitSet::or);

We don't want to print the separator before each group of context lines; we want to print it between each group. That means we have to clear the first 1-bit (if any):

    // clear the first bit
    int first = separators.nextSetBit(0);
    if (first >= 0) {
        separators.clear(first);
    }

Now, we can print out the result lines. But before printing each line, we check to see if we should print a separator first:

    context.stream()
           .forEachOrdered(i -> {
               if (separators.get(i)) {
                   System.out.println("--");
               }
               System.out.println(input.get(i));
           });
}
like image 42
Stuart Marks Avatar answered Nov 04 '22 13:11

Stuart Marks