Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the regex expression for CDATA

Hi I have an example CDATA here

<![CDATA[asd[f]]]>

and

<tag1><![CDATA[asd[f]]]></tag1><tag2><![CDATA[asd[f]]]></tag2>

The CDATA regex i have is not able to recognize this

"<![CDATA["([^\]]|"]"[^\]]|"]]"[^>])*"]]>"

this does not work too

"<![CDATA["[^\]]*[\]]{2,}([^\]>][^\]]*[\]]{2,})*">"

Will someone please give me a regex for <![CDATA[asd[f]]]>, I need to use it in Lex/Flex

: I have answered this question, please vote on my answer, thanks.

like image 662
Freddy Chua Avatar asked Jan 06 '11 15:01

Freddy Chua


People also ask

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What does (? I do in regex?

(? i) makes the regex case insensitive. (? c) makes the regex case sensitive.

What characters are allowed in regex?

A regex consists of a sequence of characters, metacharacters (such as . , \d , \D , \ s, \S , \w , \W ) and operators (such as + , * , ? , | , ^ ). They are constructed by combining many smaller sub-expressions.


2 Answers

Easy enough, it should be this:

<!\[CDATA\[.*?\]\]>

At least it works on regexpal.com

like image 129
Sean Patrick Floyd Avatar answered Sep 28 '22 08:09

Sean Patrick Floyd


The problem is that this is rather awkward to match with the sort of regular expressions used in lex; if you had a system that supported EREs, then you'd be able to do either:

<!\[CDATA\[(.*?)\]\]>

or

<!\[CDATA\[((?:[^]]|\](?!\]>))*)\]\]>

(The first uses non-greedy quantifiers, the second uses negative lookahead constraints. OK, it uses non-capturing parens too, but you can use capturing ones there instead; that's not so important.)

It's probably easier to handle this by using a similar strategy to the way C-style comments are handled in lex, by having one rule to match the start of the CDATA (on <![CDATA[) and put the lexer into a separate state that it leaves on seeing ]]>, while collecting all the characters in-between. This is instructive on the topic (and it seems that this is an area where flex and lex differ) and it covers all the strategies that you can take to make this work.

Note that cause of all these problems are because it's very difficult to write a rule with simple regular expressions that expresses the fact that a greedy regular expression must only match a ] if it is not followed by ]>. It's much easier to do if you've only got a two-character (or single character!) end-of-interesting-section sequence because you don't need such an elaborate state machine.

like image 22
Donal Fellows Avatar answered Sep 28 '22 08:09

Donal Fellows