Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently removing specific characters (some punctuation) from Strings in Java?

Tags:

java

string

regex

In Java, what is the most efficient way of removing given characters from a String? Currently, I have this code:

private static String processWord(String x) {
    String tmp;

    tmp = x.toLowerCase();
    tmp = tmp.replace(",", "");
    tmp = tmp.replace(".", "");
    tmp = tmp.replace(";", "");
    tmp = tmp.replace("!", "");
    tmp = tmp.replace("?", "");
    tmp = tmp.replace("(", "");
    tmp = tmp.replace(")", "");
    tmp = tmp.replace("{", "");
    tmp = tmp.replace("}", "");
    tmp = tmp.replace("[", "");
    tmp = tmp.replace("]", "");
    tmp = tmp.replace("<", "");
    tmp = tmp.replace(">", "");
    tmp = tmp.replace("%", "");

    return tmp;
}

Would it be faster if I used some sort of StringBuilder, or a regex, or maybe something else? Yes, I know: profile it and see, but I hope someone can provide an answer of the top of their head, as this is a common task.

like image 855
VPeric Avatar asked Jul 08 '13 16:07

VPeric


People also ask

How do you strip punctuation from a string in Java?

The standard solution to remove punctuations from a String is using the replaceAll() method. It can remove each substring of the string that matches the given regular expression. You can use the POSIX character class \p{Punct} for creating a regular expression that finds punctuation characters.

How do you remove certain symbols from a string?

Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.

How do you remove all characters before a certain character from a string in Java?

trim() . trim() removes spaces before the first character (which isn't a whitespace, such as letters, numbers etc.) of a string (leading spaces) and also removes spaces after the last character (trailing spaces).


2 Answers

Although \\p{Punct} will specify a wider range of characters than in the question, it does allow for a shorter replacement expression:

tmp = tmp.replaceAll("\\p{Punct}+", "");
like image 114
Reimeus Avatar answered Oct 13 '22 21:10

Reimeus


Here's a late answer, just for fun.

In cases like this, I would suggest aiming for readability over speed. Of course you can be super-readable but too slow, as in this super-concise version:

private static String processWord(String x) {
    return x.replaceAll("[][(){},.;!?<>%]", "");
}

This is slow because everytime you call this method, the regex will be compiled. So you can pre-compile the regex.

private static final Pattern UNDESIRABLES = Pattern.compile("[][(){},.;!?<>%]");

private static String processWord(String x) {
    return UNDESIRABLES.matcher(x).replaceAll("");
}

This should be fast enough for most purposes, assuming the JVM's regex engine optimizes the character class lookup. This is the solution I would use, personally.

Now without profiling, I wouldn't know whether you could do better by making your own character (actually codepoint) lookup table:

private static final boolean[] CHARS_TO_KEEP = new boolean[];

Fill this once and then iterate, making your resulting string. I'll leave the code to you. :)

Again, I wouldn't dive into this kind of optimization. The code has become too hard to read. Is performance that much of a concern? Also remember that modern languages are JITted and after warming up they will perform better, so use a good profiler.

One thing that should be mentioned is that the example in the original question is highly non-performant because you are creating a whole bunch of temporary strings! Unless a compiler optimizes all that away, that particular solution will perform the worst.

like image 35
Ray Toal Avatar answered Oct 13 '22 20:10

Ray Toal