Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing lots of substrings in big strings

One of our module's performance relies highly on how we replace substrings in string.

We form a "replacement map" which can contain more than 3500 string pairs and then we apply it with StringUtils.replaceEach(text, searchList, replacementList) to big strings (several MBs).

The keys and values are all unique and in most cases have the same character length (but it's not something we can rely on).

Is there a more sophisticated approach to my task than StringUtils.replaceEach()? Something which may be an overkill for simple replacements solved by replaceEach(), but which is much faster in my "heavy" case.

like image 340
okutane Avatar asked Nov 02 '16 18:11

okutane


People also ask

How do I replace multiple characters in a string?

Use the replace() method to replace multiple characters in a string, e.g. str. replace(/[. _-]/g, ' ') . The first parameter the method takes is a regular expression that can match multiple characters.

How do you replace all occurrences in a string?

To replace all occurrences of a substring in a string by a new one, you can use the replace() or replaceAll() method: replace() : turn the substring into a regular expression and use the g flag. replaceAll() method is more straight forward.


1 Answers

You can use a regexp engine, to effectively match your keys against the input string, and replace them.

First, concatenate all your keys with an alternation operator, like that:

var keys = "keyA|keyB|keyC";

Next, compile a pattern:

Pattern pattern = Pattern.compile("(" + keys + ")")

Create a matcher against your input text:

Matcher matcher= pattern.matcher(text);

Now, apply your regexp in a loop, to find all the keys in your text , and use appendReplacement (which is an "inline" string replacement method), to replace all of them with the corresponding value:

StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb,dictionary.get(matcher.group(0)));
}
matcher.appendTail(sb);

And here you go.

Note that this might look a bit naive at first, but indeed regexp engine is heavily optimized for the task at hand, and as Java regexp implementation also allows for an "inline" replacement, it all works very well.

I did a small benchmark, by applying a list of color names (~200 different color names), as defined in /usr/share/X11/rgb.txt against the "Crime and Punishment" by Fyodor Dostoyevsky, I downloaded from Project Gutenberg (~1MB in size), using the technique described, and it worked around

x12 times faster than StringUtils.replaceEach - 900ms vs 10700ms

for the latter (not counting for the Pattern compilation time).

P.S. if your keys can potentially contain characters, unsafe for regexp , like .^$(), you should use Pattern.quote() before adding them to your pattern.

Sidenote:

This method will replace keys, in the order, they appear in the pattern list, e.g. "a=>1|b=>2|aa=>3" when applied to "welcome to bazaar" will result in "welcome to b1z11r", and not "welcome to b1z3r", if you want the longest match, you should sort your keys lexicographically before adding them to the pattern (i.e. "b|aa|a"). It also applies to your original StringUtils.replaceEach() method.

Update:

The method above should work nice for the problem, as formulated in the original question, i.e. when the size of the replacement map is (relatively) small compared to the input data size.

If, instead you have a very long dictionary, applied to a short text, the linear search as done by StringUtils.replaceEach() can be faster than it.

I've made an additional benchmark illustrating that, by applying a dictionary of 10000 randomly chosen words (+4 characters long):

cat /usr/share/dict/words | grep -E "^.{4,}$" | shuf | head -10000

against the: 1024,2048,4096,8192,16384,32768,65536,131072,262144 and 524288 characters long excerpts from the very same "Crime and Punishment" text.

The results are given below:

text    Ta(ms)  Tb(ms)  Ta/Tb(speed up)
---------------------------------------
1024    99      240     0.4125
2048    43      294     0.1462585
4096    113     721     0.1567267
8192    128     1329    0.0963130
16384   320     2230    0.1434977
32768   2052    3708    0.5533981
65536   6811    6650    1.0242106
131072  32422   12663   2.5603728
262144  150655  23011   6.5470862
524288  614634  29874   20.574211
  • Ta - StringUtils.replaceEach() time
  • Tb - matcher.appendReplacement() time

Note the pattern string length is 135537 bytes (all 10000 keys concatenated)

like image 181
zeppelin Avatar answered Oct 10 '22 10:10

zeppelin