Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to concatenate two html file bodies with BeautifulSoup?

I need to concatenate the bodies of two html files into one html file, with a bit of arbitrary html as a separator in between. I have code that used to work for this, but stopped working when I upgraded from Xubuntu 11.10 (or was it 11.04?) to 12.10, probably due to a BeautifulSoup update (I'm currently using 3.2.1; I don't know what version I had previously) or to a vim update (I use vim to auto-generate the html files from plaintext ones). This is the stripped-down version of the code:

from BeautifulSoup import BeautifulSoup
soup_original_1 = BeautifulSoup(''.join(open('test1.html')))
soup_original_2 = BeautifulSoup(''.join(open('test2.html')))
contents_1 = soup_original_1.body.renderContents()
contents_2 = soup_original_2.body.renderContents()
contents_both = contents_1 + "\n<b>SEPARATOR\n</b>" + contents_2
soup_new = BeautifulSoup(''.join(open('test1.html')))
while len(soup_new.body.contents):
    soup_new.body.contents[0].extract()
soup_new.body.insert(0, contents_both)                       

The bodies of the two input files used for the test case are very simple: contents_1 is \n<pre>\nFile 1\n</pre>\n' and contents_2 is '\n<pre>\nFile 2\n</pre>\n'.

I would like soup_new.body.renderContents() to be a concatenation of those two with the separator text in between, but instead all the <'s change into &lt; etc. - the desired result is '\n<pre>\nFile 1\n</pre>\n\n<b>SEPARATOR\n</b>\n<pre>\nFile 2\n</pre>\n', which is what I used to get prior to the OS update; the current result is '\n&lt;pre&gt;\nFile 1\n&lt;/pre&gt;\n\n&lt;b&gt;SEPARATOR\n&lt;/b&gt;\n&lt;pre&gt;\nFile 2\n&lt;/pre&gt;\n', which is pretty useless.

How do I make BeautifulSoup stop turning < into &lt; etc when inserting html as a string into a soup object's body? Or should I just be doing this in an entirely different way? (This is my only experience with BeautifulSoup and most other html parsing, so I'm guessing this may well be the case.)

The html files are automatically generated from plaintext files with vim (the real cases I use are obviously more complicated, and involve custom syntax highlighting, which is why I'm doing it this way at all). The full test1.html file looks like this, and test2.html is identical except for contents and title.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<title>~/programs/lab_notebook_and_printing/concatenate-html_problem_2013/test1.txt.html</title>
<meta name="Generator" content="Vim/7.3" />
<meta name="plugin-version" content="vim7.3_v10" />
<meta name="syntax" content="none" />
<meta name="settings" content="ignore_folding,use_css,pre_wrap,expand_tabs,ignore_conceal" />
<style type="text/css">
pre { white-space: pre-wrap; font-family: monospace; color: #000000; background-color: #ffffff; white-space: pre-wrap; word-wrap: break-word }
body { font-family: monospace; color: #000000; background-color: #ffffff; font-size: 0.875em }
</style>
</head>
<body>
<pre>
File 1
</pre>
</body>
</html>
like image 729
weronika Avatar asked Nov 21 '13 21:11

weronika


People also ask

Which library is used to parse HTML and xml?

BeautifulSoup is a Python library for parsing HTML and XML documents. It is often used for web scraping. BeautifulSoup transforms a complex HTML document into a complex tree of Python objects, such as tag, navigable string, or comment.


1 Answers

Trying to read the HTML as text just to insert it into HTML and fighting the encoding and decoding in both directions is making a whole lot of extra work that's very difficult to get right.

The easy thing to do is just not do that. You want to insert everything in the body of test2 after everything in the body of test1, right? So just do that:

for element in soup_original_2.body:
    soup_original_1.body.append(element)

To append a separator first, just do the same thing with the separator:

b = soup.new_tag('b')
b.append('SEPARATOR')
soup.original_1.body.append(b)
for element in soup_original_2.body:
    soup_original_1.body.append(element)

That's it.

See the documentation section Modifying the tree for a tutorial that covers all of this.

like image 60
abarnert Avatar answered Oct 13 '22 20:10

abarnert