Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex <img > Tag parsing with src, width, height

You may react to this saying that HTML Parsing using regex is a totally bad idea, following this for example, and you are right.

But in my case, the following html node is created by our own server so we know that it will always look like this, and as the regex will be in a mobile android library, I don't want to use a library like Jsoup.

What I want to parse: <img src="myurl.jpg" width="12" height="32">

What should be parsed:

  • match a regular img tag, and group the src attribute value: <img[^>]+src\\s*=\\s*['\"]([^'\"]+)['\"][^>]*>
  • width and height attribute values: (width|height)\s*=\s*['"]([^'"]*)['"]*

So the first regex will have a #1 group with the img url, and the second regex will have two matches with subgroups of their values.

How can I merge both?

Desired output:

  • img url
  • width value
  • height value
like image 336
Hugo Gresse Avatar asked May 02 '16 09:05

Hugo Gresse


1 Answers

To match any img tag with src, height and width attributes that can come in any order and that are in fact optional, you can use

"(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^>]*?)\\3"

See the regex demo and an IDEONE Java demo:

String s = "<img height=\"132\" src=\"NEW_myurl.jpg\" width=\"112\"><link src=\"/test/test.css\"/><img src=\"myurl.jpg\" width=\"12\" height=\"32\">";
Pattern pattern = Pattern.compile("(<img\\b|(?!^)\\G)[^>]*?\\b(src|width|height)=([\"']?)([^\"]*)\\3");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
    if (!matcher.group(1).isEmpty()) { // We have a new IMG tag
        System.out.println("\n--- NEW MATCH ---");  
    }
    System.out.println(matcher.group(2) + ": " + matcher.group(4));
} 

The regex details:

  • (<img\\b|(?!^)\\G) - the initial boundary matching the <img> tag start or the end of the previous successful match
  • [^>]*? - match any optional attributes we are not interested in (0+ characters other than > so as to stay inside the tag) -\\b(src|width|height)= - a whole word src=, width= or height=
  • ([\"']?) - a technical 3rd group to check the attribute value delimiter
  • ([^>]*?) - Group 4 containing the attribute value (0+ characters other than a > as few as possible up to the first
  • \\3 - attribute value delimiter matched with the Group 3 (NOTE if a delimiter may be empty, add (?=\\s|/?>) at the end of the pattern)

The logic:

  • Match the start of img tag
  • Then, match everything that is inside, but only capture the attributes we need
  • Since we are going to have multiple matches, not groups, we need to find a boundary for each new img tag. This is done by checking if the first group is not empty (if (!matcher.group(1).isEmpty()))
  • All there remains to do is to add a list for keeping matches.
like image 135
Wiktor Stribiżew Avatar answered Nov 14 '22 02:11

Wiktor Stribiżew