I currently have a large batch of HTML text and I have several CSS properties that resemble the following:
font:16px/normal Consolas;
font:16px/normal Arial;
font:12px/normal Courier;
which is also bundled with several other CSS properties and other associated HTML values and tags.
I've been trying to write a regular expression that will only grab these "font styles", so if I had the following two paragraphs:
<p style='font:16px/normal Arial; font-weight: x; color: y;'>Stack</p>
<span style='color: z; font:16px/normal Courier;'>Overflow</span>
<br />
<div style='font-family: Segoe UI; font-size: xx-large;'>Really large</div>
it would only match the properties beginning with font:
and ending with a semicolon ;
.
I've played around using RegexHero and the closest I have gotten was:
\b(?:font[\s*\\]*:[\s*\\]*?(\b.*\b);)
which yielded the following results:
font:bold; //Match
font:12pt/normal Arial; //Match
font:16px/normal Consolas; //Match
font:12pt/normal Arial; //Match
property: value; //Not a Match
property: value value value; //Not a Match
but when I attempted to drop in a large block of HTML, things seemed to get muddled and large blocks were selected rather than within the bounds previously specified.
I'll be glad to provide any additional info and test data that I can.
Use square brackets [] to match any characters in a set. Use \w to match any single alphanumeric character: 0-9 , a-z , A-Z , and _ (underscore). Use \d to match any single digit. Use \s to match any single whitespace character.
$ means "Match the end of the string" (the position after the last character in the string).
Using the wild character *. This means the string contains our given text. It will search the input tag which contains the 'name' attribute containing 'sel' text.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
Try this
\b((?:font:[^;]*?)(?:;|'))
Explanation
\b # Assert position at a word boundary
( # Match the regular expression below and capture its match into backreference number 1
(?: # Match the regular expression below
font: # Match the characters “font:” literally
[^;] # Match any character that is NOT a “;”
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
; # Match the character “;” literally
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
' # Match the character “'” literally
)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With