Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression to Extract HTML Body Content

Tags:

I am looking for a regex statement that will let me extract the HTML content from just between the body tags from a XHTML document.

The XHTML that I need to parse will be very simple files, I do not have to worry about JavaScript content or <![CDATA[ tags, for example.

Below is the expected structure of the HTML file is that I have to parse. Since I know exactly all of the content of the HTML files that I am going to have to work with, this HTML snippet pretty much covers my entire use case. If I can get a regex to extract the body of this example, I'll be happy.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml">   <head>     <title>     </title>   </head>   <body contenteditable="true">     <p>       Example paragraph content     </p>     <p>       &nbsp;     </p>     <p>       <br />       &nbsp;     </p>     <h1>Header 1</h1>   </body> </html> 

Conceptually, I've been trying to build a regex string that matches everything BUT the inner body content. With this, I would use the C# Regex.Split() method to obtain the body content. I thought this regex:

((.|\n)*<body (.)*>)|((</body>(*|\n)*) 

...would do the trick, but it doesn't seem to work at all with my test content in RegexBuddy.

like image 273
Matthew Ruston Avatar asked Dec 10 '08 14:12

Matthew Ruston


People also ask

Can I use regex in HTML?

You can use regex to validate with JavaScript or via the HTML pattern attribute. It's easy to construct regular expressions to validate common types of form inputs like dates and usernames.

What does (? I do in regex?

E.g. (? i-sm) turns on case insensitivity, and turns off both single-line mode and multi-line mode.

What is $1 in regex replace?

For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group. For more information about numbered capturing groups, see Grouping Constructs.

What is \d in JavaScript regex?

The RegExp \D Metacharacter in JavaScript is used to search non digit characters i.e all the characters except digits. It is same as [^0-9].


1 Answers

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+) 

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+) 

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+) 
like image 124
VonC Avatar answered Oct 05 '22 04:10

VonC