I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.
The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[
tags, for example.
Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> </title> </head> <body contenteditable="true"> <p> Example paragraph content </p> <p> </p> <p> <br /> </p> <h1>Header 1</h1> </body> </html>
Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split()
method to obtain the body content. I thought this regex:
((.|\n)*<body (.)*>)|((</body>(*|\n)*)
...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.
You can use regex to validate with JavaScript or via the HTML pattern attribute. It's easy to construct regular expressions to validate common types of form inputs like dates and usernames.
E.g. (? i-sm) turns on case insensitivity, and turns off both single-line mode and multi-line mode.
For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group. For more information about numbered capturing groups, see Grouping Constructs.
The RegExp \D Metacharacter in JavaScript is used to search non digit characters i.e all the characters except digits. It is same as [^0-9].
Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s
in order to take into account < body ...>
(element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With