I need to be able to modify every single link in an HTML document. I know that I need to use the SoupStrainer
but I'm not 100% positive on how to implement it. If someone could direct me to a good resource or provide a code example, it'd be very much appreciated.
Thanks.
It is not a real HTML parser but uses regular expressions to dive through tag soup. It is therefore more forgiving in some cases and less good in others. It is not uncommon that lxml/libxml2 parses and fixes broken HTML better, but BeautifulSoup has superiour support for encoding detection.
The HTML content of the webpages can be parsed and scraped with Beautiful Soup.
Maybe something like this would work? (I don't have a Python interpreter in front of me, unfortunately)
from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Blah blah blah <a href="http://google.com">Google</a></p>') for a in soup.findAll('a'): a['href'] = a['href'].replace("google", "mysite") result = str(soup)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With