I'm trying to do some simple string manipulation with the href attribute of a hyperlink extracted using Beautiful Soup:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup('<a href="http://www.some-site.com/">Some Hyperlink</a>')
href = soup.find("a")["href"]
print href
print href[href.indexOf('/'):]
All I get is:
Traceback (most recent call last):
File "test.py", line 5, in <module>
print href[href.indexOf('/'):]
AttributeError: 'unicode' object has no attribute 'indexOf'
How should I convert whatever href
is into a normal string?
To use beautiful soup, you need to install it: $ pip install beautifulsoup4 . Beautiful Soup also relies on a parser, the default is lxml . You may already have it, but you should check (open IDLE and attempt to import lxml). If not, do: $ pip install lxml or $ apt-get install python-lxml .
Beautiful Soup is a Python package for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags, so named after tag soup). It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.
There are many Beautifulsoup methods, which allows us to search a parse tree. The two most common and used methods are find() and find_all(). Before talking about find() and find_all(), let us see some examples of different filters you can pass into these methods.
10/01/2020 In other words, SOUP is a software of unknown provenance. It is an already developed software that was not initially designed for a medical application. For example, python interpreter falls within the scope of SOUP.
Python strings do not have an indexOf
method.
Use href.index('/')
href.find('/')
is similar. But find
returns -1
if the string is not found, while index
raises a ValueError
.
So the correct thing is to use index
(since '...'[-1] will return the last character of the string).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With