Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: How to include the encoding on output?

I would like to include the encoding tag in an XML document using BeautifulSoup.BeautifulStoneSoup, but I'm not sure how!

<?xml version="1.0" encoding="UTF-8"?>
<mytag>stuff</mytag>

It outputs the encoding tag when I read a document that already has it, but I'm making a new soup.

Thanks!

Edit: I'll give an example of what I'm currently doing.

from BeautifulSoup import BeautifulStoneSoup, Tag
soup = BeautifulStoneSoup()
mytag = Tag(soup, 'mytag')
soup.append(mytag)

str(soup)
# '<mytag></mytag>'

soup.prettify() # No encoding given
# '<mytag>\n</mytag>'

soup.prettify(encoding='UTF-8')
# '<mytag>\n</mytag>' # Where's the encoding?

Even if I create the soup like BeautifulStoneSoup(fromEncoding='UTF-8'), there is still no <?xml?> tag.

Is there another way to get that tag without creating and passing the tag as a string directly, or is that the only way?

like image 662
TorelTwiddler Avatar asked Nov 13 '22 18:11

TorelTwiddler


1 Answers

Do you mean something like this?

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup('<?xml version="1.0" encoding="UTF-8"?>')
# make some more soup

Or,

soup = BeautifulStoneSoup()
# make some more soup
soup.insert(0, '<?xml version="1.0" encoding="UTF-8"?>')

From the BeautifulSoup documentation:

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

  • An encoding you pass in as the fromEncoding argument to the soup constructor.
  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

Beautiful Soup will almost always guess right if it can make a guess at all. But for documents with no declarations and in strange encodings, it will often not be able to guess.

N.B. item #2, which I read as: BeautifulSoup will use the encoding in the xml declaration automatically, if you don't explicitly specify one with the fromEncoding argument. YMMV.

There are other, potentially useful, unicode related examples in the earlier referenced documentation, as well.


Edit: @TorelTwiddler, if there is another way to add an xml declaration using BeautifulSoup without passing the tag as a string directly, I am not aware of it.

That said, consider the following:

soup = BeautifulStoneSoup('<?xml version="1.0" encoding=""?>') # <- no encoding
mytag = Tag(soup, 'mytag')
soup.append(mytag)

print str(soup)
# "<?xml version='1.0' encoding='utf-8'?><mytag></mytag>" 
# Wha!? :)
print soup.prettify(encoding='euc-jp')
# <?xml version='1.0' encoding='euc-jp'?>
# <mytag>
# </mytag>

Perhaps that'll help you get where you want to go.

like image 144
Marty Avatar answered Dec 04 '22 21:12

Marty