How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines
example inputs:
html:
<head><title>I'm title</title></head> Hello, <b>world</b>
non-html:
<ht fldf d>< <html><head> head <body></body> html
test. bind(/(<([^>]+)>)/i); It will basically return true for strings containing a < followed by ANYTHING followed by > .
Unlike most HTML parsers which generate tree structures, HTMLString generates a string of characters each with its own set of tags. This flat structure makes it easy to manipulate ranges (for example - text selected by a user) as each character is independent and doesn't rely on a hierarchical tag structure.
You can use an HTML parser, like BeautifulSoup
. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:
>>> from bs4 import BeautifulSoup >>> html = """<html> ... <head><title>I'm title</title></head> ... </html>""" >>> non_html = "This is not an html" >>> bool(BeautifulSoup(html, "html.parser").find()) True >>> bool(BeautifulSoup(non_html, "html.parser").find()) False
This basically tries to find any html element inside the string. If found - the result is True
.
Another example with an HTML fragment:
>>> html = "Hello, <b>world</b>" >>> bool(BeautifulSoup(html, "html.parser").find()) True
Alternatively, you can use lxml.html
:
>>> import lxml.html >>> html = 'Hello, <b>world</b>' >>> non_html = "<ht fldf d><" >>> lxml.html.fromstring(html).find('.//*') is not None True >>> lxml.html.fromstring(non_html).find('.//*') is not None False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With