I have a string with some HTML code in, for example:
This is <strong id="c1-id-8">some</strong> <em id="c1-id-9">text</em>
I need to strip out the id
attribute from every HTML tag, but I have zero experience with regular expressions, so I searched here and there from the internet and I wrote this pattern: [\s]+id=\".*\"
Unfortunately it's not working as I would expect. Infact, I was hoping that the regular expression would catch the id="
followed by any character repeated for any number of times and terminated with the nearest double quote; Practically in this example I was expecting to catch id="c1-id-8"
and id="c1-id-9"
.
But instead the pattern returned me the substring id="c1-id-8">some</strong> <em id="c1-id-9"
, it finds the first occurrence of id="
and the last occurrence of a double quote character.
Could you tell me what is wrong in my pattern and how to fix it, please? Thank you very much
Firstly, double quote character is nothing special in regex - it's just another character, so it doesn't need escaping from the perspective of regex. However, because Java uses double quotes to delimit String constants, if you want to create a string in Java with a double quote in it, you must escape them.
Try putting a backslash ( \ ) followed by " .
The period (.) represents the wildcard character. Any character (except for the newline character) will be matched by a period in a regular expression; when you literally want a period in a regular expression you need to precede it with a backslash.
Double quotes around a string are used to specify a regular expression search (as defined by the GNU regular expression library).
The quantifier .*
in your regex is greedy (meaning it matches as much as it can). In order to match the minimum required you could use something like /\s+id=\"[^\"]*\"/
. The brackets []
indicate a character class. So it will match everything inside of the brackets. The carat [^]
at the beginning of your character class is a negation, meaning it will match everything except what is specified in the brackets.
An alternative would be to tell the .*
quantifier to be lazy by changing it to .*?
which will match as little as it can.
In .*
the asterisk is a greedy quantifier and matches as many characters as it can, so it only stops at the last "
it finds.
You can either use ".*?"
to make it lazy, or (better IMO), use "[^"]*"
to make the match explicit:
" # match a quote
[^"]* # match any number of characters except quotes
" # match a quote
You might still need to escape the quotes if you're building the regex from a string; otherwise that's not necessary since quotes are no special characters in a regex.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With