Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java: regex replacement in large files [duplicate]

Tags:

java

regex

Java java.util.regex.Matcher replaceFirst(...)/replaceAll(...) API returns strings, which (if using the default heap size) may well cause an OOME for inputs as large as 20-50M characters. These 2 methods can be easily rewritten to write to Writers rather than construct stings, effectively eliminating one point of failure.

The Matcher's factory method, however, only accepts CharSequences, which is also likely to throw an OOME if I use Strings/StringBuffers/StringBuilders.

How do I wrap a java.io.Reader to implement a CharSequence interface (given the fact that my regexps may contain backreferences)? Is there any other solution which can replace regexps in files and is not OOME-prone on large inputs?

In other words, how do I implement a functionality similar to that of GNU sed in Java (as sed is known to tackle files as large as a couple terabytes, while featuring the same support for extended regular expressions)?

like image 261
Bass Avatar asked Jun 10 '15 10:06

Bass


1 Answers

Since what you need is actually the sed behaviour you can execute it by doing something like this:

String[] cmdArray = {"bash", "-c", "sed 's/YourRegex/YourReplaceStr/' inputfile > output"};
Process runCmd = Runtime.getRuntime().exec(cmdArray);

I put a bash example but if you want to run it on windows you can install sed command through Cygwin and execute the same or just install the sed command for windows which you can download from here:

http://gnuwin32.sourceforge.net/packages/sed.htm

For windows you could use:

String[] cmdArray = {"call", "sed 's/YourRegex/YourReplaceStr/' inputfile > output"};
Process runCmd = Runtime.getRuntime().exec(cmdArray);

I don't have windows so cannot test above command, you maybe have to remove call or to change the call to just sed. Another alternative you can try is:

String[] cmdArray = {"cmd", "/c", "sed 's/YourRegex/YourReplaceStr/' inputfile > output"};
Process runCmd = Runtime.getRuntime().exec(cmdArray);

In this link you can find an dir example executed from java you can adapt it to use sed.

like image 137
Federico Piazza Avatar answered Nov 15 '22 16:11

Federico Piazza