Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect with python if the string contains html code?

How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just a text or it contains an html. Text is typically multiline with also empty lines

Update:

example inputs:

html:

<head><title>I'm title</title></head> Hello, <b>world</b> 

non-html:

<ht fldf d>< <html><head> head <body></body> html 
like image 903
static Avatar asked Jul 20 '14 23:07

static


People also ask

How do I check if a string contains HTML?

test. bind(/(<([^>]+)>)/i); It will basically return true for strings containing a < followed by ANYTHING followed by > .

What is a HTML string?

Unlike most HTML parsers which generate tree structures, HTMLString generates a string of characters each with its own set of tags. This flat structure makes it easy to manipulate ranges (for example - text selected by a user) as each character is independent and doesn't rely on a hierarchical tag structure.


1 Answers

You can use an HTML parser, like BeautifulSoup. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:

>>> from bs4 import BeautifulSoup >>> html = """<html> ... <head><title>I'm title</title></head> ... </html>""" >>> non_html = "This is not an html" >>> bool(BeautifulSoup(html, "html.parser").find()) True >>> bool(BeautifulSoup(non_html, "html.parser").find()) False 

This basically tries to find any html element inside the string. If found - the result is True.

Another example with an HTML fragment:

>>> html = "Hello, <b>world</b>" >>> bool(BeautifulSoup(html, "html.parser").find()) True 

Alternatively, you can use lxml.html:

>>> import lxml.html >>> html = 'Hello, <b>world</b>' >>> non_html = "<ht fldf d><" >>> lxml.html.fromstring(html).find('.//*') is not None True >>> lxml.html.fromstring(non_html).find('.//*') is not None False 
like image 198
alecxe Avatar answered Sep 24 '22 06:09

alecxe