I would like to extract from a general HTML page, all the text (displayed or not). I would like to remove <ul> <li>any HTML tags</li> <li>Any javascript</li> <li>Any CSS styles</li> </ul> Is there a regular expression (one or more) that will achieve that?

Remove javascript and CSS: <pre class="prettyprint"><code><(script|style).*?</\1> </code></pre> Remove tags <pre class="prettyprint"><code><.*?> </code></pre>

regular expression to extract text from HTML

2 Answers

Remove javascript and CSS:

<(script|style).*?</\1>

Remove tags

<.*?>

answered Sep 18 '22 12:09

nickf

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.

Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

answered Sep 22 '22 12:09

S.Lott

Related questions
                            
                                Prevent calling parent when nested ui-sref
                            
                                Execute javascript after reactjs render method has completed
                            
                                HTML PHP google single sign on signout will throw "Cannot read property 'getAuthInstance' of undefined"
                            
                                Show button on hover only
                            
                                HTML Div border not showing
                            
                                What will happen if sourcemap is set as false in Angular
                            
                                padding is not working in Safari and IE in select list
                            
                                html canvas shadow being applied to everything
                            
                                How do I crop the contents of an Iframe to show a part of a page?
                            
                                Loading jQuery from Google or locally if not online
                            
                                Why does <span> break outside <div> when margin and padding is applied?
                            
                                Bootstrap - Adding legend to well
                            
                                maintain width of span even if nothing in it
                            
                                Jquery Scroll One pixel from where ever the user is on screen
                            
                                Cellpadding in one html table cell
                            
                                Integrating CSS star rating into an HTML form
                            
                                HTML email align text
                            
                                How to set Twitter Bootstrap class=error based on AngularJS input class=ng-invalid?
                            
                                jQuery dropdown selected=selected in Safari does not work
                            
                                Image zoom centered on mouse position

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

regular expression to extract text from HTML

Tags:

html

regex

text-extraction

html-content-extraction

Ron Harlev

People also ask

2 Answers

nickf

S.Lott

Recent Activity

Donate For Us