Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using regex to get text between multiple HTML tags [duplicate]

Tags:

html

c#

regex

Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:

<div>first html tag</div>
<div>another tag</div>

Would output:

first html tag
another tag

The regex pattern I am using only matches my last div tag and misses the first one. Code:

    static void Main(string[] args)
    {
        string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
        string pattern = "(<div.*>)(.*)(<\\/div>)";

        MatchCollection matches = Regex.Matches(input, pattern);
        Console.WriteLine("Matches found: {0}", matches.Count);

        if (matches.Count > 0)
            foreach (Match m in matches)
                Console.WriteLine("Inner DIV: {0}", m.Groups[2]);

        Console.ReadLine();
    }

Output:

Matches found: 1

Inner DIV: This is ANOTHER test

like image 907
Ben Avatar asked Apr 14 '13 23:04

Ben


7 Answers

As other guys didn't mention HTML tags with attributes, here is my solution to deal with that:

// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World
like image 112
Mehdi Dehghani Avatar answered Oct 04 '22 03:10

Mehdi Dehghani


Replace your pattern with a non greedy match

static void Main(string[] args)
{
    string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
    string pattern = "<div.*?>(.*?)<\\/div>";

    MatchCollection matches = Regex.Matches(input, pattern);
    Console.WriteLine("Matches found: {0}", matches.Count);

    if (matches.Count > 0)
        foreach (Match m in matches)
            Console.WriteLine("Inner DIV: {0}", m.Groups[1]);

    Console.ReadLine();
}
like image 38
coolmine Avatar answered Oct 04 '22 03:10

coolmine


I think this code should work:

string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
 {
   l.Add(match.Groups[1].Value);
 }
like image 38
Tri Nguyen Dung Avatar answered Oct 04 '22 02:10

Tri Nguyen Dung


First of all remember that in the HTML file you will have a new line symbol("\n"), which you have not included in the String which you are using to check your regex.

Second by taking you regex:

((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.

((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.

Also a good place to look for this sort of information:

http://www.regular-expressions.info/reference.html

http://www.regular-expressions.info/refadv.html

Mayman

like image 36
Mayman Avatar answered Oct 04 '22 02:10

Mayman


The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.

The reason is because HTML is a context free grammar which is a more complex class than a regular expression.

Here's an example -- what if you have multiple stacked divs?

<div><div>stuff</div><div>stuff2</div></div>

The regexes listed as other answers will grab:

<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>

because that's what regular expressions do when they try to parse HTML.

You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.

More information: https://stackoverflow.com/a/1732454/2022565

like image 27
Tom Jacques Avatar answered Oct 04 '22 04:10

Tom Jacques


Have you looked at the Html Agility Pack (see https://stackoverflow.com/a/857926/618649)?

CsQuery also looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.

CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.

If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); } (only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).

like image 21
Craig Tullis Avatar answered Oct 04 '22 04:10

Craig Tullis


I hope below regex will work:

<div.*?>(.*?)<*.div>

You will get your desired output

This is a test This is ANOTHER test

like image 37
Partha Mondal Avatar answered Oct 04 '22 03:10

Partha Mondal