Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I parse <img src> with a regex?

Tags:

regex

I need a clever regex to match ... in these:

<img src="..."
<img src='...'
<img src=...

I want to match the inner content of src, but only if it is surrounded by ", ' or none. This means that <img src=..." or <img src='... must not be accepted.

Any ideas how to match these 3 cases with one regex.

So far I use something like this ("|'|[\s\S])(.*?)\1 and the part that I want to get loose is the hacky [\S\s] which I use to match "missing symbol" on the beginning and the end of the ....

like image 619
Lachezar Avatar asked Dec 17 '22 20:12

Lachezar


1 Answers

Wow, second one I'm answering today.

Don't parse HTML with regex. Use an HTML/XML parser and your life will be much easier. Tidy will clean up your HTML code for you, so you can run the HTML through Tidy first and then through a parser. Some tidy-based libraries will perform parsing in addition to santizing, and so you may not even have to run it through another parser.

Java, for example has JTidy and PHP has PHP Tidy.

UPDATE

Against my better judgement, I'm giving you this:

/<img\s+src\s*=\s*(["'][^"']+["']|[^>]+)>/

Which works only for your specific case. Even so, it will not take into account escaped " or ' in your image-source names, or the > character. There are probably a bunch of other limitations as well. The capturing group gives you your image names (in the case of names surrounded by single or double quotes, it gives you those as well, but you can strip those out).

like image 119
Vivin Paliath Avatar answered Mar 27 '23 14:03

Vivin Paliath