I'm looking for a regular expression to match every new line character (<code>\n</code>) inside a XML tag which is <code><content></code>, or inside any tag which is inside that <code><content></code> tag, for example : <pre class="prettyprint"><code><blog> <text> (Do NOT match new lines here) </text> <content> (DO match new lines here) (Do match new lines here) </content> (Do NOT match new lines here) <content> (DO match new lines here) </content> </code></pre>

Actually... you can't use a simple regex here, at least not one. You probably need to worry about comments! Someone may write: <pre class="prettyprint"><code> </code></pre> You can take two approaches here: <ol> <li>Strip all comments out first. Then use the regex approach.</li> <li>Do not use regular expressions and use a context sensitive parsing approach that can keep track of whether or not you are nested in a comment.</li> </ol> Be careful. I am also not so sure you can match all new lines at once. @Quartz suggested this one: <pre class="prettyprint"><code><content>([^\n]*\n+)+</content> </code></pre> This will match any content tags that have a newline character RIGHT BEFORE the closing tag... but I'm not sure what you mean by matching all newlines. Do you want to be able to access all the matched newline characters? If so, your best bet is to grab all content tags, and then search for all the newline chars that are nested in between. Something more like this: <pre class="prettyprint"><code><content>.*</content> </code></pre> BUT THERE IS ONE CAVEAT: regexes are greedy, so this regex will match the first opening tag to the last closing one. Instead, you HAVE to suppress the regex so it is not greedy. In languages like python, you can do this with the "?" regex symbol. I hope with this you can see some of the pitfalls and figure out how you want to proceed. You are probably better off using an XML parsing library, then iterating over all the content tags. I know I may not be offering the best solution, but at least I hope you will see the difficulty in this and why other answers may not be right... UPDATE 1: Let me summarize a bit more and add some more detail to my response. I am going to use python's regex syntax because it is what I am more used to (forgive me ahead of time... you may need to escape some characters... comment on my post and I will correct it): To strip out comments, use this regex: Notice the "?" suppresses the .* to make it non-greedy. Similarly, to search for content tags, use: .*? Also, You may be able to try this out, and access each newline character with the match objects groups(): <pre class="prettyprint"><code><content>(.*?(\n))+.*?</content> </code></pre> I know my escaping is off, but it captures the idea. This last example probably won't work, but I think it's your best bet at expressing what you want. My suggestion remains: either grab all the content tags and do it yourself, or use a parsing library. UPDATE 2: So here is python code that ought to work. I am still unsure what you mean by "find" all newlines. Do you want the entire lines? Or just to count how many newlines. To get the actual lines, try: <pre class="prettyprint"><code>#!/usr/bin/python import re def FindContentNewlines(xml_text): # May want to compile these regexes elsewhere, but I do it here for brevity comments = re.compile(r"", re.DOTALL) content = re.compile(r"<content>(.*?)</content>", re.DOTALL) newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL) # strip comments: this actually may not be reliable for "nested comments" # How does xml handle  -->. I am not sure. But that COULD # be trouble. xml_text = re.sub(comments, "", xml_text) result = [] all_contents = re.findall(content, xml_text) for c in all_contents: result.extend(re.findall(newlines, c)) return result if __name__ == "__main__": example = """  This stuff is good <content> haha! </content> This is not found """ print FindContentNewlines(example) </code></pre> This program prints the result: <pre class="prettyprint"><code> ['', '', ' haha!', '', ''] </code></pre> The first and last empty strings come from the newline chars immediately preceeding the first <code></code> and the one coming right after the <code></code>. All in all this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print out stuff in the middle so you can see what the regexes are matching and not matching. Hope this helps :-). PS - I didn't have much luck trying out my regex from my first update to capture all the newlines... let me know if you do.

Regular Expression to match every new line character (\n) inside a <content> tag

Tags:

regex

I'm looking for a regular expression to match every new line character (\n) inside a XML tag which is <content>, or inside any tag which is inside that <content> tag, for example :

<blog> <text> (Do NOT match new lines here) </text> <content> (DO match new lines here) <p> (Do match new lines here) </p> </content> (Do NOT match new lines here) <content> (DO match new lines here) </content>

251

asked Jul 13 '09 05:07

Moayad Mardini

1 Answers

Actually... you can't use a simple regex here, at least not one. You probably need to worry about comments! Someone may write:

<!-- <content> blah </content> -->

You can take two approaches here:

Strip all comments out first. Then use the regex approach.
Do not use regular expressions and use a context sensitive parsing approach that can keep track of whether or not you are nested in a comment.

Be careful.

I am also not so sure you can match all new lines at once. @Quartz suggested this one:

<content>([^\n]*\n+)+</content>

This will match any content tags that have a newline character RIGHT BEFORE the closing tag... but I'm not sure what you mean by matching all newlines. Do you want to be able to access all the matched newline characters? If so, your best bet is to grab all content tags, and then search for all the newline chars that are nested in between. Something more like this:

<content>.*</content>

BUT THERE IS ONE CAVEAT: regexes are greedy, so this regex will match the first opening tag to the last closing one. Instead, you HAVE to suppress the regex so it is not greedy. In languages like python, you can do this with the "?" regex symbol.

I hope with this you can see some of the pitfalls and figure out how you want to proceed. You are probably better off using an XML parsing library, then iterating over all the content tags.

I know I may not be offering the best solution, but at least I hope you will see the difficulty in this and why other answers may not be right...

UPDATE 1:

Let me summarize a bit more and add some more detail to my response. I am going to use python's regex syntax because it is what I am more used to (forgive me ahead of time... you may need to escape some characters... comment on my post and I will correct it):

To strip out comments, use this regex: Notice the "?" suppresses the .* to make it non-greedy.

Similarly, to search for content tags, use: .*?

Also, You may be able to try this out, and access each newline character with the match objects groups():

<content>(.*?(\n))+.*?</content>

I know my escaping is off, but it captures the idea. This last example probably won't work, but I think it's your best bet at expressing what you want. My suggestion remains: either grab all the content tags and do it yourself, or use a parsing library.

UPDATE 2:

So here is python code that ought to work. I am still unsure what you mean by "find" all newlines. Do you want the entire lines? Or just to count how many newlines. To get the actual lines, try:

#!/usr/bin/python  import re  def FindContentNewlines(xml_text):     # May want to compile these regexes elsewhere, but I do it here for brevity     comments = re.compile(r"<!--.*?-->", re.DOTALL)     content = re.compile(r"<content>(.*?)</content>", re.DOTALL)     newlines = re.compile(r"^(.*?)$", re.MULTILINE|re.DOTALL)      # strip comments: this actually may not be reliable for "nested comments"     # How does xml handle <!--  <!-- --> -->. I am not sure. But that COULD     # be trouble.     xml_text = re.sub(comments, "", xml_text)      result = []     all_contents = re.findall(content, xml_text)     for c in all_contents:         result.extend(re.findall(newlines, c))      return result  if __name__ == "__main__":     example = """  <!-- This stuff ought to be omitted <content>   omitted </content> -->  This stuff is good <content> <p>   haha! </p> </content>  This is not found """     print FindContentNewlines(example)

This program prints the result:

 ['', '<p>', '  haha!', '</p>', '']

The first and last empty strings come from the newline chars immediately preceeding the first  and the one coming right after the . All in all this (for the most part) does the trick. Experiment with this code and refine it for your needs. Print out stuff in the middle so you can see what the regexes are matching and not matching.

Hope this helps :-).

PS - I didn't have much luck trying out my regex from my first update to capture all the newlines... let me know if you do.

130

answered Sep 24 '22 14:09

Tom

Related questions
                            
                                regex to find a pair of adjacent digits with different digits around them
                            
                                While replacing using regex, How to keep a part of matched string?
                            
                                Python glob but against a list of strings rather than the filesystem
                            
                                How to extract the nth word and count word occurrences in a MySQL string?
                            
                                How can I get at the matches when using preg_replace in PHP?
                            
                                Find "one letter that appears twice" in a string
                            
                                "vertical" regex matching in an ASCII "image"
                            
                                Regex: I want this AND that AND that... in any order
                            
                                regex implementation to replace group with its lowercase version
                            
                                Notepad++ Regular expression find and delete a line
                            
                                Should I use \d or [0-9] to match digits in a Perl regex?
                            
                                Range of valid character for a base 64 encoding
                            
                                Regex to check with starts with http://, https:// or ftp://
                            
                                Regex Letters, Numbers, Dashes, and Underscores
                            
                                How to do sed like text replace with python?
                            
                                Tokenizing Error: java.util.regex.PatternSyntaxException, dangling metacharacter '*'
                            
                                How to check, if a php string contains only english letters and digits?
                            
                                How to search in an array with preg_match?
                            
                                JavaScript - checking for any lowercase letters in a string
                            
                                Why does VIM have its own regex syntax?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With