Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove html entities and extract text content using regex

Tags:

regex

I have a text containing just HTML entities such as < and   I need to remove this all and get just the text content:

&nbspHello there<testdata>

So, I need to get Hello there and testdata from this section. Is there any way of using negative lookahead to do this?

I tried the following: /((?!&.+;).)+/ig but this doesnt seem to work very well. So, how can I just extract the required text from there?

like image 933
Mkl Rjv Avatar asked Sep 30 '14 18:09

Mkl Rjv


3 Answers

A better syntax to find HTML entities is the following regular expression:

/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/ig

This syntax ignores false entities.

like image 139
Mahoor13 Avatar answered Nov 16 '22 00:11

Mahoor13


Here are 2 suggestions:

1) Match all the entities using /(&.+;)/ig. Then, using whatever programming language you are using, replace those matches with an empty string. For example, in php use preg_replace; in C# use Regex.Replace. See this SO for a similar solution that accounts for more cases: How to remove html special chars?

2) If you really want to do this using the plaintext portions, you could try something like this: /(?:^|;)([^&;]+)(?:&|$)/ig. What its actually trying to do it match the pieces between; and & with special cases for start and end without entities. This is probably not the way to go, you're likely to run into different cases this breaks.

like image 20
dtyler Avatar answered Nov 15 '22 23:11

dtyler


It's language specific but in Python you can use html.unescape (MAN). Like:

import html
print(html.unescape("This string contains & and >"))
#prints: This string contains & and >
like image 39
gneusch Avatar answered Nov 16 '22 00:11

gneusch