In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:
private static String processWord(String x) {
String tmp;
tmp = x.toLowerCase();
tmp = tmp.replace(",", "");
tmp = tmp.replace(".", "");
tmp = tmp.replace(";", "");
tmp = tmp.replace("!", "");
tmp = tmp.replace("?", "");
tmp = tmp.replace("(", "");
tmp = tmp.replace(")", "");
tmp = tmp.replace("{", "");
tmp = tmp.replace("}", "");
tmp = tmp.replace("[", "");
tmp = tmp.replace("]", "");
tmp = tmp.replace("<", "");
tmp = tmp.replace(">", "");
tmp = tmp.replace("%", "");
return tmp;
}
Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.
The standard solution to remove punctuations from a String is using the replaceAll() method. It can remove each substring of the string that matches the given regular expression. You can use the POSIX character class \p{Punct} for creating a regular expression that finds punctuation characters.
Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.
trim() . trim() removes spaces before the first character (which isn't a whitespace, such as letters, numbers etc.) of a string (leading spaces) and also removes spaces after the last character (trailing spaces).
Although \\p{Punct}
will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:
tmp = tmp.replaceAll("\\p{Punct}+", "");
Here's a late answer, just for fun.
In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:
private static String processWord(String x) {
return x.replaceAll("[][(){},.;!?<>%]", "");
}
This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.
private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");
private static String processWord(String x) {
return UNDESIRABLES.matcher(x).replaceAll("");
}
This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.
Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:
private static final boolean[] CHARS_TO_KEEP = new boolean[];
Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)
Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.
One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With