Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex matching between two strings?

I can't seem to find a way to extract all comments like in following example.

>>> import re
>>> string = '''
... <!-- one 
... -->
... <!-- two -- -- -->
... <!-- three -->
... '''
>>> m = re.findall ( '<!--([^\(-->)]+)-->', string, re.MULTILINE)
>>> m
[' one \n', ' three ']

block with two -- -- is not matched most likely because of bad regex. Can someone please point me in right direction how to extract matches between two strings.


Hi I've tested what you guys suggested in comments.... here is working solution with little upgrade.

>>> m = re.findall ( '<!--(.*?)-->', string, re.MULTILINE)
>>> m
[' two -- -- ', ' three ']
>>> m = re.findall ( '<!--(.*\n?)-->', string, re.MULTILINE)
>>> m
[' one \n', ' two -- -- ', ' three ']

thanks!

like image 318
Hrvoje Špoljar Avatar asked Oct 04 '12 21:10

Hrvoje Špoljar


People also ask

How do I find a character in a string in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

How do you match line breaks in regex?

If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”. Whether or not you will have line breaks in your expression depends on what you are trying to match. Line breaks can be useful “anchors” that define where some pattern occurs in relation to the beginning or end of a line.

How do you use wildcards in regex?

In regular expressions, the period ( . , also called "dot") is the wildcard pattern which matches any single character. Combined with the asterisk operator . * it will match any number of any characters.


2 Answers

this should do the trick

 m = re.findall ( '<!--(.*?)-->', string, re.DOTALL)
like image 129
iruvar Avatar answered Oct 18 '22 02:10

iruvar


In general, it is impossible to do arbitrary matching between two delimiters with a regular grammar.

Specifcally, if you allow nesting,

<!-- how do you deal <!-- with nested --> comments? -->

you'll run in to issues. So, while you may be able to solve this specific problem with a regular expression, any regular expression that you write will be able to be broken by some other strange nesting of comments.

To parse arbitrary comments, you'll need to move on to a method of parsing context free grammars. A simple method to do so is to use a pushdown automaton.

like image 3
Wilduck Avatar answered Oct 18 '22 01:10

Wilduck