Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find iframe tags and retrieve attributes

Tags:

c#

.net

regex

I am trying to retrieve iframe tags and attributes from an HTML input.

Sample input

<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>

I have been trying to collect them using the following regex:

<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>

This results in

enter image description here

This is exactly the format I want.

The problem is, if the HTML attributes are in a different order this regex won't work.

Is there any way to modify this regex to ignore the attribute order and return the iframes grouped in Matches so that I could iterate through them?

like image 906
Wesley Skeen Avatar asked Nov 30 '25 12:11

Wesley Skeen


2 Answers

Here is a regex that will ignore the order of attributes:

(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>

RegexStorm demo

C# sample code:

var rx = new Regex(@"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = @"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();

Output:

enter image description here

like image 83
Wiktor Stribiżew Avatar answered Dec 02 '25 01:12

Wiktor Stribiżew


Regular expressions match patterns, and the structure of your string defines which pattern to use, thus, if you want to use regular expressions order is important.

You can deal with this in 2 ways:

  1. The good and recommended way is to not parse HTML with regular expressions (mandatory link), but rather use a parsing framework such as the HTML Agility Pack. This should allow you to process the HTML you need and extract any values you are after.

  2. The 2nd, bad, and non recommended way to do this is to break your matching into 2 parts. You first use something like so: <iframe(.+?)></iframe> to extract the entire iframe decleration and then, use multiple, smaller regular expressions to seek out and find the settings you are after. The above regex obviously fails if your iframe is structured like so: <iframe.../>. This should give you a hint as to why you should not do HTMl parsing through regular expressions.

As stated, you should go with the first option.

like image 21
npinti Avatar answered Dec 02 '25 02:12

npinti



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!