Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

To remove garbage characters from a string using regex

Tags:

regex

I want to remove characters from a string other then a-z, and A-Z. Created following function for the same and it works fine.

public String stripGarbage(String s) {
 String good = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abcdefghijklmnopqrstuvwxyz";
 String result = "";
 for (int i = 0; i < s.length(); i++) {
     if (good.indexOf(s.charAt(i)) >= 0) {
             result += s.charAt(i);
      }
   }
 return result;
}

Can anyone tell me a better way to achieve the same. Probably regex may be better option.

Regards

Harry

like image 806
Harjit Singh Avatar asked May 31 '10 11:05

Harjit Singh


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What is the use of \\ w in regex?

In regex, the uppercase metacharacter denotes the inverse of the lowercase counterpart, for example, \w for word character and \W for non-word character; \d for digit and \D or non-digit.


2 Answers

Here you go:

result = result.replaceAll("[^a-zA-Z0-9]", "");

But if you understand your code and it's readable then maybe you have the best solution:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

like image 154
Robben_Ford_Fan_boy Avatar answered Sep 28 '22 17:09

Robben_Ford_Fan_boy


The following should be faster than anything using regex, and your initial attempt.

public String stripGarbage(String s) {
    StringBuilder sb = new StringBuilder(s.length());
    for (int i = 0; i < s.length(); i++) {
        char ch = s.charAt(i);
        if ((ch >= 'A' && ch <= 'Z') || 
            (ch >= 'a' && ch <= 'z') ||
            (ch >= '0' && ch <= '9')) {
            sb.append(ch);
        }
    }
    return sb.toString();
}

Key points:

  • It is significantly faster use a StringBuilder than string concatenation in a loop. (The latter generates N - 1 garbage strings and copies N * (N + 1) / 2 characters to build a String containing N characters.)

  • If you have a good estimate of the length of the result String, it is a good idea to preallocate the StringBuilder to hold that number of characters. (But if you don't have a good estimate, the cost of the internal reallocations etc amortizes to O(N) where N is the final string length ... so this is not normally a major concern.)

  • Searching testing a character against (up to) 3 character ranges will be significantly faster on average than searching for a character in a 62 character String.

  • A switch statement might be faster especially if there are more character ranges. However, in this case it will take many more lines of code to list the cases for all of the letters and digits.

  • If the non-garbage characters match existing predicates of the Character class (e.g. Character.isLetter(char) etc) you could use those. This would be a good option if you wanted to match any letter or digit ... rather than just ASCII letters and digits.

  • Other alternatives to consider are using a HashSet<Character> or a boolean[] indexed by character that were pre-populated with the non-garbage characters. These approaches work well if the set of non-garbage characters is not known at compile time.

like image 30
Stephen C Avatar answered Sep 28 '22 18:09

Stephen C