Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get document DOCTYPE with BeautifulSoup

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.

Given the following html:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"> 
<head> 
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title> 
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" /> 
<script src="js/h5utils.js"></script> 
</head> 
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>

Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?

like image 711
Steerpike Avatar asked Mar 23 '10 11:03

Steerpike


People also ask

Is BeautifulSoup library is used to parse the document and for extracting HTML documents?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

How do I parse HTML data with BeautifulSoup?

BeautifulSoup() function helps us to parse the html file or you say the encoding in html. The loop used here with find_all() finds all the tags containing paragraph tag <p></p> and the text between them are collected by the get_text() method.

How do I extract text from P tags in BeautifulSoup?

Create an HTML document and specify the '<p>' tag into the code. Pass the HTML document into the Beautifulsoup() function. Use the 'P' tag to extract paragraphs from the Beautifulsoup object. Get text from the HTML document with get_text().

What does BeautifulSoup Find_all return?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.


2 Answers

Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)

def doctype(soup):
    items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
    return items[0] if items else None
like image 103
rptb1 Avatar answered Nov 05 '22 15:11

rptb1


You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:

for child in soup.contents:
    if isinstance(child, BS.Declaration):
        declaration_type = child.string.split()[0]
        if declaration_type.upper() == 'DOCTYPE':
            declaration = child
like image 31
zvone Avatar answered Nov 05 '22 14:11

zvone