Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define a regex to remove text-masked spam links ("spam1 dot com") from a Java String?

Tags:

java

regex

spam

I have a list of sites that represent spam links:

List<String> bannedSites = ["spam1.com", "spam2.com", "spam3.com"];

Is there a regex way of removing links matching these banned sites from this text:

Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com 
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer

Notice the link may have multiple URL formats which aioobe's solution does a good job of identifying:

    String input = "Dear Arezzo,\n"
        + "Please check out my website at spam1.com or http://www.spam1.com" 
        + "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
        + "Thank you.";

    List<String> bannedSites = Arrays.asList("spam1.com", "spam2.com", "spam3.com");

    StringBuilder re = new StringBuilder();
    for (String bannedSite : bannedSites) {
        if (re.length() > 0)
            re.append("|");
        re.append(String.format("http://(www\\.)?%s\\S*|%1$s",
                                Pattern.quote(bannedSite)));
    }

    System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));

But while the code above works great for the URL formats spam1.com or http://www.spam1.com or http://spam1.com, it misses the multiple text formats:

How can I modify the regex to target text formats such as these?

spam1 dot com
spam1[.com]
spam1 .com
spam1 . com

The idea is to produce a result like this:

Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED] 
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer

As I remarked in the comments below, I probably don't need to ban the whole string spam1 dot com. If I can efface just the spam1 part so that it becomes: [LINK REMOVED] dot com - that would do the job.

like image 750
arezzo Avatar asked Nov 05 '22 13:11

arezzo


1 Answers

Here's a start for you.

import java.util.*;
import java.util.regex.Pattern;

class Test {
    public static void main(String[] args) {

        String input = "Dear Arezzo,\n"
            + "Please check out my website at spam1.com "
            + "or http://www.spam1.com or http://spam1.com or " 
            + "spam1 dot com to win millions of dollars in prizes.\n"
            + "Thank you.";

        List<String> bannedSites = Arrays.asList("spam1", "spam2", "spam3");

        StringBuilder re = new StringBuilder();
        for (String bannedSite : bannedSites) {
            if (re.length() > 0)
                re.append("|");
            String quotedSite = Pattern.quote(bannedSite);
            re.append("https?://(www\\.)?" + quotedSite + "\\S*");
            re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)");
            //re.append("|" ... your variation here);
        }

        System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
    }
}

Output:

Dear Arezzo,

Please check out my website at LINK REMOVED or LINK REMOVED or LINK REMOVED or LINK REMOVED to win millions of dollars in prizes. Thank you.

Extend the regular expression as needed.

like image 59
aioobe Avatar answered Nov 09 '22 10:11

aioobe