I have a list of sites that represent spam links:
List<String> bannedSites = ["spam1.com", "spam2.com", "spam3.com"];
Is there a regex way of removing links matching these banned sites from this text:
Dear Arezzo,
Please check out my website at spam1.com or http://www.spam1.com
or http://spam1.com or spam1 dot com to win millions of dollars in prizes.
Thank you.
Big Spammer
Notice the link may have multiple URL formats which aioobe's solution does a good job of identifying:
String input = "Dear Arezzo,\n"
+ "Please check out my website at spam1.com or http://www.spam1.com"
+ "or http://spam1.com or spam1 dot com to win millions of dollars in prizes."
+ "Thank you.";
List<String> bannedSites = Arrays.asList("spam1.com", "spam2.com", "spam3.com");
StringBuilder re = new StringBuilder();
for (String bannedSite : bannedSites) {
if (re.length() > 0)
re.append("|");
re.append(String.format("http://(www\\.)?%s\\S*|%1$s",
Pattern.quote(bannedSite)));
}
System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
But while the code above works great for the URL formats spam1.com
or http://www.spam1.com
or http://spam1.com
, it misses the multiple text formats:
spam1 dot com
spam1[.com]
spam1 .com
spam1 . com
The idea is to produce a result like this:
Dear Arezzo,
Please check out my website at [LINK REMOVED] or [LINK REMOVED]
or [LINK REMOVED] or [LINK REMOVED] to win millions of dollars in prizes.
Thank you.
Big Spammer
As I remarked in the comments below, I probably don't need to ban the whole string spam1 dot com
. If I can efface just the spam1
part so that it becomes: [LINK REMOVED] dot com
- that would do the job.
Here's a start for you.
import java.util.*;
import java.util.regex.Pattern;
class Test {
public static void main(String[] args) {
String input = "Dear Arezzo,\n"
+ "Please check out my website at spam1.com "
+ "or http://www.spam1.com or http://spam1.com or "
+ "spam1 dot com to win millions of dollars in prizes.\n"
+ "Thank you.";
List<String> bannedSites = Arrays.asList("spam1", "spam2", "spam3");
StringBuilder re = new StringBuilder();
for (String bannedSite : bannedSites) {
if (re.length() > 0)
re.append("|");
String quotedSite = Pattern.quote(bannedSite);
re.append("https?://(www\\.)?" + quotedSite + "\\S*");
re.append("|" + quotedSite + "\\s*(dot|\\.)?\\s*(com|net|org)");
//re.append("|" ... your variation here);
}
System.out.println(input.replaceAll(re.toString(), "LINK REMOVED"));
}
}
Output:
Dear Arezzo,
Please check out my website at LINK REMOVED or LINK REMOVED or LINK REMOVED or LINK REMOVED to win millions of dollars in prizes. Thank you.
Extend the regular expression as needed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With