I have a text containing just HTML entities such as <
and
I need to remove this all and get just the text content:
 Hello there<testdata>
So, I need to get Hello there
and testdata
from this section. Is there any way of using negative lookahead to do this?
I tried the following: /((?!&.+;).)+/ig
but this doesnt seem to work very well. So, how can I just extract the required text from there?
A better syntax to find HTML entities is the following regular expression:
/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/ig
This syntax ignores false entities.
Here are 2 suggestions:
1) Match all the entities using /(&.+;)/ig
. Then, using whatever programming language you are using, replace those matches with an empty string. For example, in php use preg_replace; in C# use Regex.Replace. See this SO for a similar solution that accounts for more cases: How to remove html special chars?
2) If you really want to do this using the plaintext portions, you could try something like this: /(?:^|;)([^&;]+)(?:&|$)/ig
. What its actually trying to do it match the pieces between;
and &
with special cases for start and end without entities. This is probably not the way to go, you're likely to run into different cases this breaks.
It's language specific but in Python you can use html.unescape
(MAN). Like:
import html
print(html.unescape("This string contains & and >"))
#prints: This string contains & and >
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With