Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java PatternSyntaxException: Unmatched closing '('

Tags:

java

I need to remove all the URLs found in Twitter messages. I have a file with around 200,000 such messages so speed is crucial! To do this I use Java as a programming language, here is an example of my code:

public String performStrip(){

    String tweet = this.getRawTweet();
    String urlPattern = "((https?|http)://(bit\\.ly|t\\.co|lnkd\\.in|tcrn\\.ch)\\S*)\\b";

    Pattern p = Pattern.compile(urlPattern,Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(tweet);

    int i = 0;

    while (m.find()) {
        tweet = tweet.replaceAll(m.group(i),"").trim();
        i++;
    }

    return tweet;
}

This works fine in following cases:

http://t.co/nhWp9hldEH        -> (empty string)
http://t.co/nhWp9hldEH"       -> "
http://t.co/nhWp9hldEH)aaa"   -> aaa"
aaa(http://t.co/nhWp9hldEH"   -> aaa("
aaa(http://t.co/nhWp9hldEH)"  -> aaa()"

However, when I get to a case as follows:

http://t.co/nhWp9hldEH)aaa"

I get an error

java.util.regex.PatternSyntaxException: Unmatched closing ')' near index 21

http://t.co/nhWp9hldEH)aa

at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.compile(Pattern.java:1669)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.replaceAll(String.java:2210)
at com.anturo.preprocess.url.UrlStripper.performStrip(UrlStripper.java:47)
at com.anturo.preprocess.testing.ReadIn.<init>(ReadIn.java:35)
at com.anturo.preprocess.testing.Main.main(Main.java:6)

I already looked into multiple similar questions regarding this error, however none have worked so far... Hoping someone can help me out here.

like image 403
RazorAlliance192 Avatar asked May 24 '26 19:05

RazorAlliance192


1 Answers

The problem is that you may have regex special characters in a URL, as you can see.

Short solution: use Pattern.quote(). Your code would then be:

tweet = tweet.replaceAll(Pattern.quote(m.group(i)),"").trim();

Note: only available since JDK 1.5, but you do use this or better, right?

Another solution is to simply use .replace():

tweet = tweet.replace(m.group(i), "").trim();

Unlike what its name suggests with regards to .replaceAll(), .replace() does replace all occurrences; it is simply that it doesn't take a regex as a replacement string. See also .replaceFirst().

Last but not least, you seem to be misusing .group()! Your loop should be:

while (m.find())
    tweet = tweet.replace(m.group(), "").trim();

No need for the i variable here; m.group(i) will, for one match, return what is matched by capturing group i in your regex.

like image 80
fge Avatar answered May 26 '26 07:05

fge



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!