I have a text files that have a lot of string lines in there. If I want to find lines before and after a matching in grep, I will do like this:
grep -A 10 -B 10 "ABC" myfile.txt
How can I implements the equivalent in Java 8 using stream?
If you're willing to use a third party library and don't need parallelism, then jOOλ offers SQL-style window functions as follows
Seq.seq(Files.readAllLines(Paths.get(new File("/path/to/Example.java").toURI())))
.window(-1, 1)
.filter(w -> w.value().contains("ABC"))
.forEach(w -> {
System.out.println("-1:" + w.lag().orElse(""));
System.out.println(" 0:" + w.value());
System.out.println("+1:" + w.lead().orElse(""));
// ABC: Just checking
});
Yielding
-1: .window(-1, 1)
0: .filter(w -> w.value().contains("ABC"))
+1: .forEach(w -> {
-1: System.out.println("+1:" + w.lead().orElse(""));
0: // ABC: Just checking
+1: });
The lead()
function accesses the next value in traversal order from the window, the lag()
function accesses the previous row.
Disclaimer: I work for the company behind jOOλ
Such scenario is not well-supported by Stream API as existing methods do not provide an access to the element neighbors in the stream. The closest solution which I can think up without creating custom iterators/spliterators and third-party library calls is to read the input file into List
and then use indices Stream:
List<String> input = Files.readAllLines(Paths.get(fileName));
Predicate<String> pred = str -> str.contains("ABC");
int contextLength = 10;
IntStream.range(0, input.size()) // line numbers
// filter them leaving only numbers of lines satisfying the predicate
.filter(idx -> pred.test(input.get(idx)))
// add nearby numbers
.flatMap(idx -> IntStream.rangeClosed(idx-contextLength, idx+contextLength))
// remove numbers which are out of the input range
.filter(idx -> idx >= 0 && idx < input.size())
// sort numbers and remove duplicates
.distinct().sorted()
// map to the lines themselves
.mapToObj(input::get)
// output
.forEachOrdered(System.out::println);
The grep output also includes special delimiter like "--"
to designate the omitted lines. If you want to go further and mimic such behavior as well, I can suggest you to try my free StreamEx library as it has intervalMap
method which is helpful in this case:
// Same as IntStream.range(...).filter(...) steps above
IntStreamEx.ofIndices(input, pred)
// same as above
.flatMap(idx -> IntStream.rangeClosed(idx-contextLength, idx+contextLength))
// remove numbers which are out of the input range
.atLeast(0).less(input.size())
// sort numbers and remove duplicates
.distinct().sorted()
.boxed()
// merge adjacent numbers into single interval and map them to subList
.intervalMap((i, j) -> (j - i) == 1, (i, j) -> input.subList(i, j + 1))
// flatten all subLists prepending them with "--"
.flatMap(list -> StreamEx.of(list).prepend("--"))
// skipping first "--"
.skip(1)
.forEachOrdered(System.out::println);
As Tagir Valeev noted, this kind of problem isn't well supported by the streams API. If you incrementally want to read lines from the input and print out matching lines with context, you'd have to introduce a stateful pipeline stage (or a custom collector or spliterator) which adds quite a bit of complexity.
If you're willing to read all the lines into memory, it turns out that BitSet
is a useful representation for manipulating groups of matches. This bears some similarity to Tagir's solution, but instead of using integer ranges to represent lines to be printed, it uses 1-bits in a BitSet
. Some advantages of BitSet
are that it has a number of built-in bulk operations, and it has a compact internal representation. It can also produce a stream of indexes of the 1-bits, which is quite useful for this problem.
First, let's start out by creating a BitSet
that has a 1-bit for each line that matches the predicate:
void contextMatch(Predicate<String> pred, int before, int after, List<String> input) {
int len = input.size();
BitSet matches = IntStream.range(0, len)
.filter(i -> pred.test(input.get(i)))
.collect(BitSet::new, BitSet::set, BitSet::or);
Now that we have the bit set of matching lines, we stream out the indexes of each 1-bit. We then set the bits in the bitset that represent the before and after context. This gives us a single BitSet
whose 1-bits represent all of the lines to be printed, including context lines.
BitSet context = matches.stream()
.collect(BitSet::new,
(bs,i) -> bs.set(Math.max(0, i - before), Math.min(i + after + 1, len)),
BitSet::or);
If we just want to print out all the lines, including context, we can do this:
context.stream()
.forEachOrdered(i -> System.out.println(input.get(i)));
The actual grep -A a -B b
command prints a separator between each group of context lines. To figure out when to print a separator, we look at each 1-bit in the context bit set. If there's a 0-bit preceding it, or if it's at the very beginning, we set a bit in the result. This gives us a 1-bit at the beginning of each group of context lines:
BitSet separators = context.stream()
.filter(i -> i == 0 || !context.get(i-1))
.collect(BitSet::new, BitSet::set, BitSet::or);
We don't want to print the separator before each group of context lines; we want to print it between each group. That means we have to clear the first 1-bit (if any):
// clear the first bit
int first = separators.nextSetBit(0);
if (first >= 0) {
separators.clear(first);
}
Now, we can print out the result lines. But before printing each line, we check to see if we should print a separator first:
context.stream()
.forEachOrdered(i -> {
if (separators.get(i)) {
System.out.println("--");
}
System.out.println(input.get(i));
});
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With