Using regex, I want to be able to get the text between multiple DIV tags. For instance, the following:
<div>first html tag</div>
<div>another tag</div>
Would output:
first html tag
another tag
The regex pattern I am using only matches my last div tag and misses the first one. Code:
static void Main(string[] args)
{
string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
string pattern = "(<div.*>)(.*)(<\\/div>)";
MatchCollection matches = Regex.Matches(input, pattern);
Console.WriteLine("Matches found: {0}", matches.Count);
if (matches.Count > 0)
foreach (Match m in matches)
Console.WriteLine("Inner DIV: {0}", m.Groups[2]);
Console.ReadLine();
}
Output:
Matches found: 1
Inner DIV: This is ANOTHER test
As other guys didn't mention HTML tags with attributes
, here is my solution to deal with that:
// <TAG(.*?)>(.*?)</TAG>
// Example
var regex = new System.Text.RegularExpressions.Regex("<h1(.*?)>(.*?)</h1>");
var m = regex.Match("Hello <h1 style='color: red;'>World</h1> !!");
Console.Write(m.Groups[2].Value); // will print -> World
Replace your pattern with a non greedy match
static void Main(string[] args)
{
string input = "<div>This is a test</div><div class=\"something\">This is ANOTHER test</div>";
string pattern = "<div.*?>(.*?)<\\/div>";
MatchCollection matches = Regex.Matches(input, pattern);
Console.WriteLine("Matches found: {0}", matches.Count);
if (matches.Count > 0)
foreach (Match m in matches)
Console.WriteLine("Inner DIV: {0}", m.Groups[1]);
Console.ReadLine();
}
I think this code should work:
string htmlSource = "<div>first html tag</div><div>another tag</div>";
string pattern = @"<div[^>]*?>(.*?)</div>";
MatchCollection matches = Regex.Matches(htmlSource, pattern, RegexOptions.IgnoreCase | RegexOptions.Singleline);
ArrayList l = new ArrayList();
foreach (Match match in matches)
{
l.Add(match.Groups[1].Value);
}
First of all remember that in the HTML file you will have a new line symbol("\n"), which you have not included in the String which you are using to check your regex.
Second by taking you regex:
((<div.*>)(.*)(<\\/div>))+ //This Regex will look for any amount of div tags, but it must see at least one div tag.
((<div.*>)(.*)(<\\/div>))* //This regex will look for any amount of div tags, and it will not complain if there are no results at all.
Also a good place to look for this sort of information:
http://www.regular-expressions.info/reference.html
http://www.regular-expressions.info/refadv.html
Mayman
The short version is that you cannot do this correctly in all situations. There will always be cases of valid HTML for which a regular expression will fail to extract the information you want.
The reason is because HTML is a context free grammar which is a more complex class than a regular expression.
Here's an example -- what if you have multiple stacked divs?
<div><div>stuff</div><div>stuff2</div></div>
The regexes listed as other answers will grab:
<div><div>stuff</div>
<div>stuff</div>
<div>stuff</div><div>stuff2</div>
<div>stuff</div><div>stuff2</div></div>
<div>stuff2</div>
<div>stuff2</div></div>
because that's what regular expressions do when they try to parse HTML.
You can't write a regular expression that understands how to interpret all of the cases, because regular expressions are incapable of doing so. If you are dealing with a very specific constrained set of HTML, it may be possible, but you should keep this fact in mind.
More information: https://stackoverflow.com/a/1732454/2022565
Have you looked at the Html Agility Pack (see https://stackoverflow.com/a/857926/618649)?
CsQuery also looks pretty useful (basically use CSS selector-style syntax to get the elements). See https://stackoverflow.com/a/11090816/618649.
CsQuery is basically meant to be "jQuery for C#," which is pretty much the exact search criteria I used to find it.
If you could do this in a web browser, you could easily use jQuery, using syntax similar to $("div").each(function(idx){ alert( idx + ": " + $(this).text()); }
(only you would obviously output the result to the log, or the screen, or make a web service call with it, or whatever you need to do with it).
I hope below regex will work:
<div.*?>(.*?)<*.div>
You will get your desired output
This is a test This is ANOTHER test
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With