Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding comments in HTML

I have an HTML file and within it there may be Javascript, PHP and all this stuff people may or may not put into their HTML file.

I want to extract all comments from this html file.

I can point out two problems in doing this:

  1. What is a comment in one language may not be a comment in another.

  2. In Javascript, remainder of lines are commented out using the // marker. But URLs also contain // within them and I therefore may well eliminate parts of URLs if I just apply substituting // and then the remainder of the line, with nothing.

So this is not a trivial problem.

Is there anywhere some solution for this already available?

Has anybody already done this?

like image 922
john-jones Avatar asked Oct 19 '12 10:10

john-jones


2 Answers

Problem 2: Isn't every url quoted, with either "www.url.com" or 'www.url.com', when you write it in either language? I'm not sure. If that's the case then all you haft to do is to parse the code and check if there's any quote marks preceding the backslashes to know if it's a real url or just a comment.

like image 189
Swedish dude Avatar answered Sep 19 '22 01:09

Swedish dude


Look into parser generators like ANTLR which has grammars for many languages and write a nesting parser to reliably find comments. Regular expressions aren't going to help you if accuracy is important. Even then, it won't be 100% accurate.

Consider

Problem 3, a comment in a language is not always a comment in a language.

<textarea><!-- not a comment --></textarea>
<script>var re = /[/*]not a comment[*/]/, str = "//not a comment";</script>

Problem 4, a comment embedded in a language may not obviously be a comment.

<button onclick="&#47;&#47; this is a comment//&#10;notAComment()">

Problem 5, what is a comment may depend on how the browser is configured.

<noscript><!-- </noscript> Whether this is a comment depends on whether JS is turned on -->
<!--[if IE 8]>This is a comment, except on IE 8<![endif]-->

I had to solve this problem partially for contextual templating systems that elide comments from source code to prevent leaking software implementation details.

https://github.com/mikesamuel/html-contextual-autoescaper-java/blob/master/src/tests/com/google/autoesc/HTMLEscapingWriterTest.java#L1146 shows a testcase where a comment is identified in JavaScript, and later testcases show comments identified in CSS and HTML. You may be able to adapt that code to find comments. It will not handle comments in PHP code sections.

like image 22
Mike Samuel Avatar answered Sep 19 '22 01:09

Mike Samuel