Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup parser appends semicolons to naked ampersands, mangling URLs?

I am trying to parse some site in python that has links in it to other sites, but in plain text, not in "a" tag. Using BeautifulSoup I get the wrong answer. Consider this code:

import BeautifulSoup

html = """<html>
            <head>
              <title>Test html</title>
            </head>
            <body>
              <div>
                example.com/a.php?b=2&c=15
              </div>
            </body>
          </html>"""

parsed = BeautifulSoup.BeautifulSoup(html)
print parsed

when I run the above code I get the following output:

<html>
  <head>
    <title>Test html</title>
  </head>
  <body>
    <div>
      example.com/a.php?b=2&c;=15
    </div>
  </body>
</html>

Notice the link in the "div" and the part b=2&c;=15. It's different from the original HTML. Why is BeautifulSoup messing with the links in such a way. Is it trying to automagically create HTML entites? How to prevent this?

like image 701
c0ldcrow Avatar asked Aug 25 '11 09:08

c0ldcrow


1 Answers

Apparently BS has an underdocumented issue parsing ampersands inside URL, I just searched their discussion forum for 'semicolon'. According to that discussion from 2009, naked & is strictly not valid and must be replaced by &amp; although browsers accept this so it seems waay pedantic.

I agree this parsing behavior is bogus, and you should contact their list to ask them to at least document this better as a known issue, and fix it in future.

Workaround: Anyway, your workaround will most likely be re.sub(...) to capture and expand & -> &amp; only inside URLs. Possibly you need a reverse function to compress them in the output. You'll need a fancier regex to capture only ampersands inside URLs, but anyway:

# Minimal string to tickle this
#html = "<html>example.com/a.php?b=2&c=15&d=42</html>"
html = "<html>example.com/a.php?b=2&c=15&amp;d=29&e=42</html>"

html = re.sub(r'&(?!amp;)', r'&amp;', html)

parsed = BeautifulSoup.BeautifulSoup(html)
>>> print parsed.text.encode('utf-8')
'example.com/a.php?b=2&amp;c=15'

>>> re.sub(r'&amp;', r'&', parsed.text.encode('utf-8'))
'example.com/a.php?b=2&c=15'

There may be other more BS-thonic approaches. You may want to help test the 4.0 beta.

like image 178
smci Avatar answered Sep 19 '22 18:09

smci