Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to select a particular content, provided it is not enclosed in comments

Tags:

regex

sql

oracle

I am looking for a regular expression which matches the pattern src="*.js", but this should not be enclosed in a comment.

consider the following

<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
<!----<script type="text/javascript" src="js/Shop.js"></script>  -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

Extended sample input, described by OP as "correct":

<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
<!----<script type="text/javascript" src="js/Shop.js"></script>  -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

The result should not match line 1 and 2 (where the content is enclosed with comment). It should only match line 3 and 4 (3-end, except comment-end line, for extended sample input).

So far I have this regexp which selects all my .js files but also the ones that are commented out: (src=\")+(\S)+(.js)

I am looking for a regex which only selects the script tags with a .js src attribute that are not surrounded by a comment.

I would also like to mention that I am using this regular expression in an Oracle PL SQL query.

like image 475
Umair Tarafdar Avatar asked Mar 14 '18 09:03

Umair Tarafdar


3 Answers

I don't know if you can do what you want with a single regular expression, especially since Oracle's implementation of regular expressions does not support lookaround. But there are some things you can do with SQL to get around these limitations. The following will extract the matches for the pattern, first by removing comments from the text, then by matching the patter src=".*\.js" in what remains. Multiple results are retrieved using CONNECT BY:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM (
        SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
        <!----<script type="text/javascript" src="js/Shop.js"></script>  -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
        -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script>
        <script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
          FROM dual
    )
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
   AND PRIOR html_id = html_id
   AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

If these results are stored in a table somewhere, then you would do the following:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM mytable
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src=".*\.js"', 1, LEVEL, 'i') IS NOT NULL
   AND PRIOR html_id = html_id
   AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

It seems strange but the final two lines is necessary to avoid duplicate results.

Results as follows:

| HTML_ID | MATCH                              |
+---------+------------------------------------+
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.serialize-object.js"   |
|       1 | src="jquery.cookie.js"             |
+---------+------------------------------------+

SQL Fiddle HERE.

Hope this helps.

EDIT: Edited according to my comment below:

SELECT html_id, REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') AS match
  FROM (
    SELECT html_id, REGEXP_REPLACE(html_text, '<!--.*?-->', '', 1, 0, 'n') AS clean_html
      FROM (
        SELECT 1 AS html_id, '<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
        <!----<script type="text/javascript" src="js/Shop.js"></script>  -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
        -- afterwards -->
        <script type="text/javascript" src="jquery.serialize-object.js"></script>
        <script type="text/javascript" src="jquery.cookie.js"></script>' AS html_text
          FROM dual
    )
)
CONNECT BY REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i') IS NOT NULL
   AND PRIOR html_id = html_id
   AND PRIOR DBMS_RANDOM.VALUE IS NOT NULL;

EDITED

If you're searching a CLOB rather than a CHAR column, the first line of the CONNECT BY clause should look like this. REGEXP_SUBSTR() will return a CLOB if the relevant column is a CLOB, and the comparison just takes forever in this case:

CONNECT BY DBMS_LOB.SUBSTR(REGEXP_SUBSTR(clean_html, 'src="[^"]*\.js"', 1, LEVEL, 'i'), 4000, 1) IS NOT NULL

Hope this helps.

like image 120
David Faber Avatar answered Oct 12 '22 23:10

David Faber


For e.g. this sample input:

<!------<script type="text/javascript" src="js/Shop.js"></script>  -->
<!----<script type="text/javascript" src="js/Shop.js"></script>  -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
-- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

This regex: src="[^"]*\.js\"></script>(\s*<!--[^>]*-->)*(\s*<!--[^>]*)?$
will give you this output:

<script type="text/javascript" src="jquery.serialize-object.js"></script><!---->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment -- afterwards -->
<script type="text/javascript" src="jquery.serialize-object.js"></script><!-- a comment starting but not ending
<script type="text/javascript" src="jquery.serialize-object.js"></script>
<script type="text/javascript" src="jquery.cookie.js"></script>

I tested with GNU grep 2.5.4, hoping that it gets close enough to your regex flavor. The regex is very light on special features.

Explanation:

  • \"[^"]* is "anything within " "
  • (<!--[^>]*-->)* is "any number of complete comments, if they do not contain > "
  • (<!--[^>]*)?$ is "an optional start of a non-> comment at the end of a line"
  • \s* allowing optional white space

Note, at some point of possible complexity of relevant input, regexes stop being the right tool. Beyond, a dedicated tool, i.e. a parser for XML/html whatever is the choice.
For me that point is reached when the possibility occurs of the relevant input being "hidden" inside a multiline comment. I feel that you turned the question into a moving target, by first confirming that expecting relevant input on one line is allowed (apart from a comment starting afterwards) but then changed the rules, by adding contradicting sample input. At one point you did describe the sample input I proposed as "correct".
The (very funny) XML/regex discussing QA linked in the comments demonstrates the hell you can end up in, if you do not draw the line early enough.
When restricted into a given environment, e.g. SQL server, the special abilities of that environment should be leveraged. Surely processing the non-commented parts of the input by SQL mechanisms to achieve a some steps further goal is possible. I.e. drop your immediate idea of how to proceed and take a little detour in thinking. Try to make sure that you do not exhaust yourself on a XY-problem.

like image 22
Yunnosch Avatar answered Oct 12 '22 23:10

Yunnosch


I've put a negative look-ahead before the end of your regex, but mind that if there's a commented part after the src it will likewise be ignored.

(src=\")+(\S)+(\.js\")+(?!.*-->)(.*)

Edit:

I managed something similiar without the lookahead (which PL/SQL doesn't have):

(src=\")(\S)+(\.js\")[^(--)\n]+(\n|$)
like image 39
Lance Toth Avatar answered Oct 13 '22 00:10

Lance Toth