How to concatenate two html file bodies with BeautifulSoup?

Q: Which library is used to parse HTML and xml?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.

Tags:

python

html

beautifulsoup

I need to concatenate the bodies of two html files into one html file, with a bit of arbitrary html as a separator in between. I have code that used to work for this, but stopped working when I upgraded from Xubuntu 11.10 (or was it 11.04?) to 12.10, probably due to a BeautifulSoup update (I'm currently using 3.2.1; I don't know what version I had previously) or to a vim update (I use vim to auto-generate the html files from plaintext ones). This is the stripped-down version of the code:

from BeautifulSoup import BeautifulSoup
soup_original_1 = BeautifulSoup(''.join(open('test1.html')))
soup_original_2 = BeautifulSoup(''.join(open('test2.html')))
contents_1 = soup_original_1.body.renderContents()
contents_2 = soup_original_2.body.renderContents()
contents_both = contents_1 + "\n<b>SEPARATOR\n</b>" + contents_2
soup_new = BeautifulSoup(''.join(open('test1.html')))
while len(soup_new.body.contents):
    soup_new.body.contents[0].extract()
soup_new.body.insert(0, contents_both)

The bodies of the two input files used for the test case are very simple: contents_1 is \n<pre>\nFile 1\n</pre>\n' and contents_2 is '\n<pre>\nFile 2\n</pre>\n'.

I would like soup_new.body.renderContents() to be a concatenation of those two with the separator text in between, but instead all the <'s change into < etc. - the desired result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is what I used to get prior to the OS update; the current result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is pretty useless.

How do I make BeautifulSoup stop turning < into < etc when inserting html as a string into a soup object's body? Or should I just be doing this in an entirely different way? (This is my only experience with BeautifulSoup and most other html parsing, so I'm guessing this may well be the case.)

The html files are automatically generated from plaintext files with vim (the real cases I use are obviously more complicated, and involve custom syntax highlighting, which is why I'm doing it this way at all). The full test1.html file looks like this, and test2.html is identical except for contents and title.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>~/programs/lab_notebook_and_printing/concatenate-html_problem_2013/test1.txt.html</title>
<meta name="Generator" content="Vim/7.3" />
<meta name="plugin-version" content="vim7.3_v10" />
<meta name="syntax" content="none" />
<meta name="settings" content="ignore_folding,use_css,pre_wrap,expand_tabs,ignore_conceal" />
<style type="text/css">
pre { white-space: pre-wrap; font-family: monospace; color: #000000; background-color: #ffffff; white-space: pre-wrap; word-wrap: break-word }
body { font-family: monospace; color: #000000; background-color: #ffffff; font-size: 0.875em }
</style>
</head>
<body>
<pre>
File 1
</pre>
</body>
</html>

729

asked Nov 21 '13 21:11

weronika

1 Answers

Trying to read the HTML as text just to insert it into HTML and fighting the encoding and decoding in both directions is making a whole lot of extra work that's very difficult to get right.

The easy thing to do is just not do that. You want to insert everything in the body of test2 after everything in the body of test1, right? So just do that:

for element in soup_original_2.body:
    soup_original_1.body.append(element)

To append a separator first, just do the same thing with the separator:

b = soup.new_tag('b')
b.append('SEPARATOR')
soup.original_1.body.append(b)
for element in soup_original_2.body:
    soup_original_1.body.append(element)

That's it.

See the documentation section Modifying the tree for a tutorial that covers all of this.

answered Oct 13 '22 20:10

abarnert

Related questions
                            
                                matplotlib linewidth when saving a PDF
                            
                                What is the difference between json.dumps/loads and tornado.escape.json_encode/json_decode?
                            
                                Find out into how many values a return value will be unpacked
                            
                                Python > Uncompyle2 - usage
                            
                                Sqlalchemy returns "stale" rows?
                            
                                TFIDF calculating confusion
                            
                                Random walk pandas
                            
                                Passing a Python list to php
                            
                                Print list in table format in python
                            
                                Align two lists by adding special values for missing entries
                            
                                Python class variable name vs __name__
                            
                                Fastest way to parse XML in Python
                            
                                Interaction between networkx and matplotlib
                            
                                Can I use TLS version 1.1 or 1.2 in python 2?
                            
                                When is the object() built-in useful?
                            
                                zip function giving incorrect output
                            
                                Pandas Dataframe add header without replacing current header
                            
                                PySide: set width of QVBoxLayout
                            
                                Unit testing a Django query set
                            
                                Flask-Principal Best Practice of Handling PermissionDenied Exception

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With