I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document. The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <code><![CDATA[</code> tags, for example. Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy. <pre class="prettyprint"><code><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title> </title> </head> <body contenteditable="true"> Example paragraph content &nbsp; &nbsp; <h1>Header 1</h1> </body> </html> </code></pre> Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# <code>Regex.Split()</code> method to obtain the body content. I thought this regex: <pre class="prettyprint lang-none prettyprint-override"><code>((.|\n)*<body (.)*>)|((</body>(*|\n)*) </code></pre> ...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.

Would this work ? <pre class="prettyprint"><code>((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+) </code></pre> Of course, you need to add the necessary <code>\s</code> in order to take into account <code>< body ...></code> (element with spaces), as in: <pre class="prettyprint"><code>((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+) </code></pre> On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document): <pre class="prettyprint"><code>(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+) </code></pre>

Regular Expression to Extract HTML Body Content

Q: Can I use regex in HTML?

You can use regex to validate with JavaScript or via the HTML pattern attribute. It's easy to construct regular expressions to validate common types of form inputs like dates and usernames.

Q: What does (? I do in regex?

E.g. (? i-sm) turns on case insensitivity, and turns off both single-line mode and multi-line mode.

Q: What is $1 in regex replace?

For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group. For more information about numbered capturing groups, see Grouping Constructs.

Q: What is \d in JavaScript regex?

The RegExp \D Metacharacter in JavaScript is used to search non digit characters i.e all the characters except digits. It is same as [^0-9].

Tags:

I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">   <head>     <title>     </title>   </head>   <body contenteditable="true">     <p>       Example paragraph content     </p>     <p>       &nbsp;     </p>     <p>       <br />       &nbsp;     </p>     <h1>Header 1</h1>   </body> </html>

Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:

((.|\n)*<body (.)*>)|((</body>(*|\n)*)

...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.

273

asked Dec 10 '08 14:12

Matthew Ruston

1 Answers

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

124

answered Oct 05 '22 04:10

VonC

Related questions
                            
                                Generate Delete Statement From Foreign Key Relationships in SQL 2008?
                            
                                Linux display average CPU load for last week
                            
                                Finding prime numbers with the Sieve of Eratosthenes (Originally: Is there a better way to prepare this array?)
                            
                                Algorithm for neatly indenting SQL statements (Python implementation would be nice)
                            
                                What is the maximum revision number supported by SVN?
                            
                                Algorithm for finding nearby points?
                            
                                What is Component-Driven Development?
                            
                                Learning Python for a .NET developer [closed]
                            
                                How to correctly setup a NSPredicate for a to-many relationship when using Core Data?
                            
                                How do I run JUnit from NetBeans?
                            
                                Interface Builder and Xcode integration not working
                            
                                Grails how to change the current locale

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With