Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate through an html string to find all img tags and replace the src attribute values

I have an html code as a string. I need to find all img tags in that string, read the value of each src attribute and pass it to a function, that function returns an entire img tag that needs to take the place of the img tag that was read.

It needs to iterate through the whole string and execute the same logic for all img tags.

For example, suppose that my html string looks like this:

string htmlBody= "<p>Hi everyone</p><img src=\"..." <p>I am here </p> <img src=\"..." />"

I have the following code which finds the first img tag, takes the src value (which is a base64 string) and convert it into an array of bits to create an stream, then i can create a new src value which link to that stream.

  //Remove from all src attributes "data:image/png;base64"      
  string res = Regex.Replace(htmlBody, "data:image\\/\\w+\\;base64\\,", "");
  //Match the img tag and get the base64  string value
  string matchString = Regex.Match(res, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
  var imageData = Convert.FromBase64String(matchString);
  var contentId = Guid.NewGuid().ToString();
  LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
  inline.ContentId = contentId;
  inline.TransferEncoding = TransferEncoding.Base64;
  //Replace all img tags with the new img tag 
  htmlBody = Regex.Replace(htmlBody, "<img.+?src=[\"'](.+?)[\"'].*?>", @"<img src='cid:" + inline.ContentId + @"'/>");

As you can see finnaly i have got the new img tag to replace:

   <img src='cid:" + inline.ContentId + @"'/>

But the code will replace all the img tag with the same content. I need to be able to get the img tag, execute the logic, replace it and then, continue with the next img tag.

Hope you can give me an idea how i can do that. Thanks in advance.

like image 495
D.B Avatar asked Sep 30 '16 07:09

D.B


2 Answers

If I understand your need correctly you can use HtmlAgilityPack for this purpose. Using regex may cause unwanted behavior. Can you try the code below ?

public static string DoIt()
{
        string htmlString = "";
        using (WebClient client = new WebClient())
            htmlString = client.DownloadString("http://dean.edwards.name/my/base64-ie.html"); //This is an example source for base64 img src, you can change this directly to your source.

        HtmlDocument document = new HtmlDocument();
        document.LoadHtml(htmlString);
        document.DocumentNode.Descendants("img")
                            .Where(e =>
                            {
                                string src = e.GetAttributeValue("src", null) ?? "";
                                return !string.IsNullOrEmpty(src) && src.StartsWith("data:image");
                            })
                            .ToList()
                            .ForEach(x =>
                            {
                                string currentSrcValue = x.GetAttributeValue("src", null);
                                currentSrcValue = currentSrcValue.Split(',')[1];//Base64 part of string
                                byte[] imageData = Convert.FromBase64String(currentSrcValue);
                                string contentId = Guid.NewGuid().ToString();
                                LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
                                inline.ContentId = contentId;
                                inline.TransferEncoding = TransferEncoding.Base64;

                                x.SetAttributeValue("src", "cid:" + inline.ContentId);
                            });


        string result = document.DocumentNode.OuterHtml;
}

You can retrieve HtmlAgilityPack from https://www.nuget.org/packages/HtmlAgilityPack

Hope this helps

like image 81
Cihan Uygun Avatar answered Oct 10 '22 15:10

Cihan Uygun


I think you need to iterate your code for each img fetched form the string. The following code gives you the list of all the img tags:

public static List<string> FetchImgsFromSource(string htmlSource)
        {
            List<string> listOfImgdata = new List<string>();
            string regexImgSrc = @"<img[^>]*?src\s*=\s*[""']?([^'"" >]+?)[ '""][^>]*?>";
            MatchCollection matchesImgSrc = Regex.Matches(htmlSource, regexImgSrc, RegexOptions.IgnoreCase | RegexOptions.Singleline);
            foreach (Match m in matchesImgSrc)
            {
                string href = m.Groups[1].Value;
                listOfImgdata.Add(href);
            }
            return listOfImgdata;
        }

use this list and user logic in a loop:

foreach (var item in listOfImgdata )
            {
                var imageData = Convert.FromBase64String(item);
                var contentId = Guid.NewGuid().ToString();
                LinkedResource inline = new LinkedResource(new MemoryStream(imageData), "image/jpeg");
                inline.ContentId = contentId;
                inline.TransferEncoding = TransferEncoding.Base64;
                //Replace all img tags with the new img tag 
                htmlBody = Regex.Replace(htmlBody, "<img.+?src=[\"'](.+?)[\"'].*?>", @"<img src='cid:" + inline.ContentId + @"'/>");
            }

Hope it works for you.

Also the best way to parse HTML dom is to use HtmlAgilityPack as mentioned by others.

like image 41
Pabdev Avatar answered Oct 10 '22 15:10

Pabdev