Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to search text surrounded by double-quotes with RegEx?

Tags:

regex

I have a string with some HTML code in, for example:

This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>

I need to strip out the id attribute from every HTML tag, but I have zero experience with regular expressions, so I searched here and there from the internet and I wrote this pattern: [\s]+id=\".*\"

Unfortunately it's not working as I would expect. Infact, I was hoping that the regular expression would catch the id=" followed by any character repeated for any number of times and terminated with the nearest double quote; Practically in this example I was expecting to catch id="c1-id-8" and id="c1-id-9". But instead the pattern returned me the substring id="c1-id-8">some</strong> <em id="c1-id-9", it finds the first occurrence of id=" and the last occurrence of a double quote character.

Could you tell me what is wrong in my pattern and how to fix it, please? Thank you very much

like image 736
Cesco Avatar asked Sep 25 '11 13:09

Cesco


People also ask

How do you match double quotes in regex?

Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.

How do you put quotes in regex?

Try putting a backslash ( \ ) followed by &quot; .

How do you match periods in regex?

The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.

What do quotes mean in regex?

Double quotes around a string are used to specify a regular expression search (as defined by the GNU regular expression library).


2 Answers

The quantifier .* in your regex is greedy (meaning it matches as much as it can). In order to match the minimum required you could use something like /\s+id=\"[^\"]*\"/. The brackets [] indicate a character class. So it will match everything inside of the brackets. The carat [^] at the beginning of your character class is a negation, meaning it will match everything except what is specified in the brackets.

An alternative would be to tell the .* quantifier to be lazy by changing it to .*? which will match as little as it can.

like image 119
nachito Avatar answered Nov 23 '22 20:11

nachito


In .* the asterisk is a greedy quantifier and matches as many characters as it can, so it only stops at the last " it finds.

You can either use ".*?" to make it lazy, or (better IMO), use "[^"]*" to make the match explicit:

"      # match a quote
[^"]*  # match any number of characters except quotes
"      # match a quote

You might still need to escape the quotes if you're building the regex from a string; otherwise that's not necessary since quotes are no special characters in a regex.

like image 34
Tim Pietzcker Avatar answered Nov 23 '22 20:11

Tim Pietzcker