I want to remove html content and the tags
<DATE> html content </DATE>
These are the different versions of the code I have tried, none of them worked:
content = content.replaceAll("<DATE>(?s:)</DATE>", "");
content = content.replaceAll("<DATE>(?:.|\n)</DATE>", "");
content = content.replaceAll("<DATE>" + Pattern.DOTALL + "</DATE>", "");
content = content.replaceAll("<DATE>(.*?)</DATE>", "");
Any suggestions?
Complete Code:
Path corpusPath = Paths.get(path + file);
String content = new String(Files.readAllBytes(corpusPath), charset);
content = content.replaceAll("<HEADLINE>", "<DOCHDR>");
content = content.replaceAll("</HEADLINE>", "</DOCHDR>");
content = content.replaceAll("<DATE>(.*?)</DATE>", "");
Path destPath = Paths.get(path + "Parsed\\" +file);
Files.write(destPath, content.getBytes(charset));
Try the below regex to remove <DATE> tag along with it's content,
content = content.replaceAll("(?s)<DATE>.*?</DATE>", "");
Explanation:
(?s) DOTALL Modifier enables DOTALL mode. It make dot to match even newline characters also.<DATE> Matches the starting <DATE> tag..*? Matches any character upto the next </DATE> string. ? after * tells the regex engine to does a shortest match.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With