You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.
But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.
What I want to parse: <img src="myurl.jpg" width="12" height="32">
What should be parsed:
<img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
(width|height)\s*=\s*['"]([^'"]*)['"]*
So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.
How can I merge both?
Desired output:
To match any img
tag with src
, height
and width
attributes that can come in any order and that are in fact optional, you can use
"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"
See the regex demo and an IDEONE Java demo:
String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
System.out.println("\n--- NEW MATCH ---");
}
System.out.println(matcher.group(2) + ": " + matcher.group(4));
}
The regex details:
(<img\\b|(?!^)\\G)
- the initial boundary matching the <img>
tag start or the end of the previous successful match[^>]*?
- match any optional attributes we are not interested in (0+ characters other than >
so as to stay inside the tag)
-\\b(src|width|height)=
- a whole word src=
, width=
or height=
([\"']?)
- a technical 3rd group to check the attribute value delimiter([^>]*?)
- Group 4 containing the attribute value (0+ characters other than a >
as few as possible up to the first\\3
- attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>)
at the end of the pattern)The logic:
img
tagimg
tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty())
)If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With