I am trying to parse some site in python that has links in it to other sites, but in plain text, not in "a" tag. Using BeautifulSoup I get the wrong answer. Consider this code:
import BeautifulSoup
html = """<html>
<head>
<title>Test html</title>
</head>
<body>
<div>
example.com/a.php?b=2&c=15
</div>
</body>
</html>"""
parsed = BeautifulSoup.BeautifulSoup(html)
print parsed
when I run the above code I get the following output:
<html>
<head>
<title>Test html</title>
</head>
<body>
<div>
example.com/a.php?b=2&c;=15
</div>
</body>
</html>
Notice the link in the "div" and the part b=2&c;=15. It's different from the original HTML. Why is BeautifulSoup messing with the links in such a way. Is it trying to automagically create HTML entites? How to prevent this?
Apparently BS has an underdocumented issue parsing ampersands inside URL, I just searched their discussion forum for 'semicolon'. According to that discussion from 2009, naked &
is strictly not valid and must be replaced by &
although browsers accept this so it seems waay pedantic.
I agree this parsing behavior is bogus, and you should contact their list to ask them to at least document this better as a known issue, and fix it in future.
Workaround: Anyway, your workaround will most likely be re.sub(...)
to capture and expand &
-> &
only inside URLs. Possibly you need a reverse function to compress them in the output. You'll need a fancier regex to capture only ampersands inside URLs, but anyway:
# Minimal string to tickle this
#html = "<html>example.com/a.php?b=2&c=15&d=42</html>"
html = "<html>example.com/a.php?b=2&c=15&d=29&e=42</html>"
html = re.sub(r'&(?!amp;)', r'&', html)
parsed = BeautifulSoup.BeautifulSoup(html)
>>> print parsed.text.encode('utf-8')
'example.com/a.php?b=2&c=15'
>>> re.sub(r'&', r'&', parsed.text.encode('utf-8'))
'example.com/a.php?b=2&c=15'
There may be other more BS-thonic approaches. You may want to help test the 4.0 beta.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With