I am trying to parse a HTML code in order to extract all links in it. To avoid unavailable links I remove the commented code that begins with <!--
and ends with -->
.Here comes the problem: In the HTML code I may find some JavaScript code, for example:
<html>
<HEAD>
<SCRIPT LANGUAGE="JavaScript">
<!-- Begin
if (document.images) {
var pic2 = new Image(); // for the inactive image
pic2.src = "pic2.jpg";
var title2 = new Image();
title2.src = "title2.jpg";
}
...
-->
and the weird thing is that the js code is commented but it still works. So, if I remove that code, the result won't be as expected. What should I do in order to identify when I'm facing with unused commented code and when that commented code is functional?
Single Line Comments Single line Javascript comments start with two forward slashes (//). All text after the two forward slashes until the end of a line makes up a comment, even when there are forward slashes in the commented text.
To create a single line comment in JavaScript, begin the line with two forward slashes ( // ). Here's an example of that: // This text is a comment and will be ignored! You can also add a single line comment on the same line as some code.
In HTML, a comment is text enclosed within < ! ╌ ╌> tags. This syntax tells the browser that they are comments and should not be rendered on the front end.
Creating Single Line Comments To create a single line comment in JavaScript, you place two slashes "//" in front of the code or text you wish to have the JavaScript interpreter ignore. When you place these two slashes, all text to the right of them will be ignored, until the next line.
the weird thing is that the js code is commented but it still works
Those aren't comments. Is is just syntax allowed inside script (and style) elements that follows the comment syntax so that browsers which predate script and style don't render the code as text.
What should I do in order to identify when I'm facing with unused commented code and when that commented code is functional?
Write a real HTML parser, following the parsing specification, and then remove any comment nodes from the generated DOM.
As a dirty (but possibly quick) solution, you could just ignore comments inside elements marked as containing CDATA in the HTML 4.01 DTD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With