I am using beautiful soup and I am writing a crawler and have the following code in it:
print soup.originalEncoding #self.addtoindex(page, soup) links=soup('a') for link in links: if('href' in dict(link.attrs)): link['href'].replace('..', '') url=urljoin(page, link['href']) if url.find("'") != -1: continue url = url.split('?')[0] url = url.split('#')[0] if url[0:4] == 'http': newpages.add(url) pages = newpages
The link['href'].replace('..', '')
is supposed to fix links that come out as ../contact/orderform.aspx, ../contact/requestconsult.aspx, etc. However, it is not working. Links still have the leading ".." Is there something I am missing?
1 Answer. You are facing this issue because you are using the replace method incorrectly. When you call the replace method on a string in python you get a new string with the contents replaced as specified in the method call. You are not storing the modified string but are just using the unmodified string.
The Python replace() method is used to find and replace characters in a string. It requires a substring to be passed as an argument; the function finds and replaces it. The replace() method is commonly used in data cleaning.
string.replace() returns the string with the replaced values. It doesn't modify the original so do something like this:
link['href'] = link['href'].replace("..", "")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With