Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression for getting text between XML elements

Tags:

java

regex

xml

I am looking at this regular expressions

<(\\w*)>\\.*</(\\w*)>

Going thru tutorials etc. I understand it as reading, match anything that follows the form

<tag1>blah</tag1>

i.e. an XML element, some text and a closing XML element. However, when I run it on various regular expression checkers for example, Expresso it is not matching what I think it should.

Note: to complicate matters further this regular expression is in Java which as I understand means there are some subtle differences.

What are my missing?

Anything appreciated...

Thanks

like image 596
dublintech Avatar asked Mar 06 '26 01:03

dublintech


2 Answers

Use:

<(\w*)>.*</(\w*)>

\\w – literal \, then w
\\ – literal \

like image 186
Kirill Polishchuk Avatar answered Mar 08 '26 13:03

Kirill Polishchuk


Escaping is only needed for literals, but some languages use \ to escape characters in strings themselves, forcing you to use \\ in the string to mean \ in regex land. And trying to pull off \\ (a literal \ in regex) can be \\\\ in such languages. I think this can be the cause of the confusion when seeing \\ in example code.

Improving the regex:

If someone wanted to be a douche and construct an irregular expression like:

< _some_tag some="stuff" >
    some <strong>content</strong>
< / _some_tag >

You can use this more generic regex that will capture the tag name, content and attributes.

<\s*([A-Za-z_]\w*)\s*([^\>]+)>(.*?)<\s*\/\s\1\s*>

Note that .*? is required in case the same tag exists further in the page, otherwise keeping it greedy will make it capture everything until the last tag with that name closes. Also <tag1>blah</tag2> is obviously bogus, but if you wanted to have that flexible you could just change the last part of this regex.

like image 23
Aram Kocharyan Avatar answered Mar 08 '26 15:03

Aram Kocharyan