Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jsoup check if string is valid HTML

Tags:

java

jsoup

I am having difficulties with Jsoup parser. How can I tell if given string is a valid HTML code?

String input = "Your vote was successfully added."
boolean isValid = Jsoup.isValid(input);
// isValid = true

isValid flag is true, because Jsoup first uses HtmlTreeBuilder: if ony of html, head or body tag is missing, it adds them by itself. Then it uses Cleaner class and checks it against given Whitelist.

Is there any simple way to check if string is a valid HTML without Jsoup attempts to make it HTML?

My example is AJAX response, which comes as "text/html" content type. Then it goes to parser, Jsoup adds this tags and as a result, response is not displayed properly.

Thanks for your help.

like image 344
user464592 Avatar asked Jan 20 '14 12:01

user464592


2 Answers

First of all, solution proposed by Reuben is not working as expected. Pattern has to be compiled with Pattern.DOTALL flag. Input HTML may have (and probably will) new line signs etc.

So it should be something like this:

Pattern htmlPattern = Pattern.compile(".*\\<[^>]+>.*", Pattern.DOTALL);
boolean isHTML = htmlPattern.matcher(input).matches();

I also think that this pattern should find HTML tag not only . Next: is not the only valid option. There may also be attribute i.e . This also has to be handled.

I chose to modify Jsoup source. If HTMLTreeBuilder (actually state BeforeHtml) tries to add <html> element I throw ParseException and then I am sure that input file was not a valid HTML file.

like image 66
user464592 Avatar answered Sep 29 '22 20:09

user464592


Use regex to check String contains HTML or not

boolean isHTML = input.matches(".*\\<[^>]+>.*");

If your String contains HTML value then it will return true

String input = "<html><body></body></html>" ;

But this code String input = "Hello World <>"; will return false

like image 33
Reuben Avatar answered Sep 29 '22 20:09

Reuben