I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object. Given the following html: <pre class="prettyprint"><code><!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html lang="en"> <head> <meta charset=utf-8 /> <meta name="viewport" content="width=620" /> <title>HTML5 Demos and Examples</title> <link rel="stylesheet" href="/css/html5demos.css" type="text/css" /> <script src="js/h5utils.js"></script> </head> <body> This is paragraph one This is paragraph two. </html> </code></pre> Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?

Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!) <pre class="prettyprint"><code>def doctype(soup): items = [item for item in soup.contents if isinstance(item, bs4.Doctype)] return items[0] if items else None </code></pre>

You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is: <pre class="prettyprint"><code>for child in soup.contents: if isinstance(child, BS.Declaration): declaration_type = child.string.split()[0] if declaration_type.upper() == 'DOCTYPE': declaration = child </code></pre>

Get document DOCTYPE with BeautifulSoup

Tags:

python

parsing

beautifulsoup

scrapy

I've just started tinkering with scrapy in conjunction with BeautifulSoup and I'm wondering if I'm missing something very obvious but I can't seem to figure out how to get the doctype of a returned html document from the resulting soup object.

Given the following html:

Click to copy

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html lang="en"> 
<head> 
<meta charset=utf-8 />
<meta name="viewport" content="width=620" />
<title>HTML5 Demos and Examples</title> 
<link rel="stylesheet" href="/css/html5demos.css" type="text/css" /> 
<script src="js/h5utils.js"></script> 
</head> 
<body>
<p id="firstpara" align="center">This is paragraph <b>one</b>
<p id="secondpara" align="blah">This is paragraph <b>two</b>.
</html>

Can anyone tell me if there's a way of extracting the declared doctype from it using BeautifulSoup?

711

asked Mar 23 '10 11:03

Steerpike

2 Answers

Beautiful Soup 4 has a class for DOCTYPE declarations, so you can use that to extract all the declarations at top level (though you're no doubt expecting one or none!)

Click to copy

def doctype(soup):
    items = [item for item in soup.contents if isinstance(item, bs4.Doctype)]
    return items[0] if items else None

103

answered Nov 05 '22 15:11

rptb1

You can go through top-level elements and check each to see whether it is a declaration. Then you can inspect it to find out what kind of declaration it is:

Click to copy

for child in soup.contents:
    if isinstance(child, BS.Declaration):
        declaration_type = child.string.split()[0]
        if declaration_type.upper() == 'DOCTYPE':
            declaration = child

answered Nov 05 '22 14:11

zvone

Related questions
                            
                                Pipfile Hash Creation
                            
                                How to open .ipynb file in Spyder?
                            
                                Does it make sense to multi-thread within multiprocessing?
                            
                                How does parent of custom exception class get the arguments if I don't call super().__init__()?
                            
                                Create a symmetric matrix that counts the relational records
                            
                                Numpy empty list type inference
                            
                                Why does the lines count differently using two different way. to load text?
                            
                                Proprietary plug-ins for GPL programs: what about interpreted languages? [closed]
                            
                                What will be the upgrade path to Python 3.x for Google App Engine Applications?
                            
                                SQLAlchemy many-to-many orphan deletion
                            
                                How to resize svg image file using librsvg Python binding
                            
                                Python Subprocess - Redirect stdout/err to two places
                            
                                Profiling a python multiprocessing pool
                            
                                module reimported if imported from different path
                            
                                Safely executing user-submitted python code on the server
                            
                                Is Google App Engine right for me?
                            
                                setup.py: installing just a pth file?
                            
                                Python: slicing a very large binary file
                            
                                How to filter a query by property of user profile in Django?
                            
                                How do I delete a foreign key constraint in SQLAlchemy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get document DOCTYPE with BeautifulSoup

Tags:

python

parsing

beautifulsoup

scrapy

Steerpike

People also ask

2 Answers

rptb1

zvone

Recent Activity

Donate For Us