Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex - Find Content of div by id with nested divs

Tags:

regex

Before anybody asks, I am not doing any kind of screenscraping.

I'm trying to parse an html string to find a div with a certain id. I cannot for the life of me get this to work. The following expression worked in one instance, but not in another. I'm not sure if it has to do with extra elements in the html or not.

<div\s*?id=(\""|&quot;|&#34;)content(\""|&quot;|&#34;).*?>\s*?(?>(?! <div\s*?> | </div> ) | <div\s*?>(?<DEPTH>) | </div>(?<-DEPTH>) | .?)*(?(DEPTH)(?!))</div>

It is finding the first div with the right id correctly, but it then closes at the first closing div, and not the related div.

<div id="firstdiv">begining content<div id="content">some other stuff
    <div id="otherdiv">other stuff here</div>
    more stuff
    </div>
</div>

This should bring back

<div id="content">some other stuff
   <div id="otherdiv">other stuff here</div>
   more stuff
</div>

, but for some reason, it is not. It is bring back:

   <div id="content">some other stuff
      <div id="otherdiv">other stuff here</div>

Does anybody have an easier expression to handle this?

To clarify, this is in .NET, and I'm using the DEPTH keyword. You can find more details here.

like image 626
ncyankee Avatar asked Nov 13 '08 02:11

ncyankee


3 Answers

In .NET you can do this:

(?<text>
(<div\s*?id=(\"|&quot;|&\#34;)content(\"|&quot;|&\#34;).*?>)

  (?>
      .*?</div>
    |
      .*?<div (?>depth)
    |
      .*?</div> (?>-depth)
  )*)
  (?(depth)(?!))
.*?</div>

You must use the singleline option. Here is an example using the console:

using System;
using System.Text.RegularExpressions;

namespace Temp
{
    class Program
    {
        static void Main()
        {
            string s = @"
<div id=""firstdiv"">begining content<div id=""content"">some other stuff
  <div id=""otherdiv"">other stuff here</div>
  more stuff
  </div>
</div>";
            Regex r = new Regex(@"(?<text>(<div\s*?id=(\""|&quot;|&\#34;)"
                + @"content(\""|&quot;|&\#34;).*?>)(?>.*?</div>|.*?<div "
                + @"(?>depth)|.*?</div> (?>-depth))*)(?(depth)(?!)).*?</div>",
                RegexOptions.Singleline);
            Console.WriteLine("HTML:\n");
            Console.WriteLine(s);
            Match m = r.Match(s);
            if (m.Success)
            {
                Console.WriteLine("\nCaptured text:\n");
                Console.WriteLine(m.Groups[4]);

            }
            Console.ReadLine();
        }
    }
}
like image 108
pro3carp3 Avatar answered Nov 13 '22 08:11

pro3carp3


Are you asking for a regular expression that can keep track of the number of DIV tags nested inside a DIV tag? I'm afraid that isn't possible with regular expressions.

You could use a regular expression to get the index of the first DIV tag, then loop over the characters in the string, starting at that index, and keeping a count of the number of open div tags. When you encounter a close div-tag, and the count is zero, then you have the starting and ending indices in the string that contains the substring you want.

like image 21
Cybis Avatar answered Nov 13 '22 08:11

Cybis


Cybis speaks the truth. This sort of stuff falls into Context-Free Languages, which are more powerful than Regular Languages (the kind of things covered by regular expressions). There's a lot of computer science theory involved, but let it rest to say that any language worth its salt will have a library for this sort of stuff written that you should probably be using.

like image 2
Dan Fego Avatar answered Nov 13 '22 06:11

Dan Fego