One of our module's performance relies highly on how we replace substrings in string.
We form a "replacement map" which can contain more than 3500 string pairs and then we apply it with StringUtils.replaceEach(text, searchList, replacementList)
to big strings (several MBs).
The keys and values are all unique and in most cases have the same character length (but it's not something we can rely on).
Is there a more sophisticated approach to my task than StringUtils.replaceEach()
? Something which may be an overkill for simple replacements solved by replaceEach()
, but which is much faster in my "heavy" case.
Use the replace() method to replace multiple characters in a string, e.g. str. replace(/[. _-]/g, ' ') . The first parameter the method takes is a regular expression that can match multiple characters.
To replace all occurrences of a substring in a string by a new one, you can use the replace() or replaceAll() method: replace() : turn the substring into a regular expression and use the g flag. replaceAll() method is more straight forward.
You can use a regexp engine, to effectively match your keys against the input string, and replace them.
First, concatenate all your keys with an alternation operator, like that:
var keys = "keyA|keyB|keyC";
Next, compile a pattern:
Pattern pattern = Pattern.compile("(" + keys + ")")
Create a matcher against your input text:
Matcher matcher= pattern.matcher(text);
Now, apply your regexp in a loop, to find all the keys in your text , and use appendReplacement (which is an "inline" string replacement method), to replace all of them with the corresponding value:
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb,dictionary.get(matcher.group(0)));
}
matcher.appendTail(sb);
And here you go.
Note that this might look a bit naive at first, but indeed regexp engine is heavily optimized for the task at hand, and as Java regexp implementation also allows for an "inline" replacement, it all works very well.
I did a small benchmark, by applying a list of color names (~200 different color names), as defined in /usr/share/X11/rgb.txt against the "Crime and Punishment" by Fyodor Dostoyevsky, I downloaded from Project Gutenberg (~1MB in size), using the technique described, and it worked around
x12 times faster than StringUtils.replaceEach - 900ms vs 10700ms
for the latter (not counting for the Pattern compilation time).
P.S. if your keys can potentially contain characters, unsafe for regexp , like .^$(), you should use Pattern.quote() before adding them to your pattern.
Sidenote:
This method will replace keys, in the order, they appear in the pattern list, e.g. "a=>1|b=>2|aa=>3" when applied to "welcome to bazaar" will result in "welcome to b1z11r", and not "welcome to b1z3r", if you want the longest match, you should sort your keys lexicographically before adding them to the pattern (i.e. "b|aa|a"). It also applies to your original StringUtils.replaceEach() method.
Update:
The method above should work nice for the problem, as formulated in the original question, i.e. when the size of the replacement map is (relatively) small compared to the input data size.
If, instead you have a very long dictionary, applied to a short text, the linear search as done by StringUtils.replaceEach() can be faster than it.
I've made an additional benchmark illustrating that, by applying a dictionary of 10000 randomly chosen words (+4 characters long):
cat /usr/share/dict/words | grep -E "^.{4,}$" | shuf | head -10000
against the: 1024,2048,4096,8192,16384,32768,65536,131072,262144 and 524288 characters long excerpts from the very same "Crime and Punishment" text.
The results are given below:
text Ta(ms) Tb(ms) Ta/Tb(speed up)
---------------------------------------
1024 99 240 0.4125
2048 43 294 0.1462585
4096 113 721 0.1567267
8192 128 1329 0.0963130
16384 320 2230 0.1434977
32768 2052 3708 0.5533981
65536 6811 6650 1.0242106
131072 32422 12663 2.5603728
262144 150655 23011 6.5470862
524288 614634 29874 20.574211
Note the pattern string length is 135537 bytes (all 10000 keys concatenated)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With