Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

regular expression to extract text from HTML

I would like to extract from a general HTML page, all the text (displayed or not).

I would like to remove

  • any HTML tags
  • Any javascript
  • Any CSS styles

Is there a regular expression (one or more) that will achieve that?

like image 827
Ron Harlev Avatar asked Oct 08 '08 01:10

Ron Harlev


People also ask

Can you use regular expressions to parse HTML?

HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

What is HTML regex?

Regular expressions, or regex for short, are a series of special characters that define a search pattern. These expressions can remove lengthy validation functions and replace them with simple expressions.


2 Answers

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>
like image 63
nickf Avatar answered Sep 18 '22 12:09

nickf


You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like &lt;text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.


Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

like image 42
S.Lott Avatar answered Sep 22 '22 12:09

S.Lott