I'm trying to parse a website and get some info with the find_all()
method, but it doesn't find them all.
This is the code:
#!/usr/bin/python3 from bs4 import BeautifulSoup from urllib.request import urlopen page = urlopen ("http://mangafox.me/directory/") # print (page.read ()) soup = BeautifulSoup (page.read ()) manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None) for manga in manga_img: print (manga['href'])
It only prints half of them...
Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.
find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.
Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml
parser is not dealing very well with it:
>>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get('http://mangafox.me/directory/') >>> soup = BeautifulSoup(r.content, 'lxml') >>> len(soup.find_all('a', class_='manga_img')) 18
The standard library html.parser
has less trouble with this specific page:
>>> soup = BeautifulSoup(r.content, 'html.parser') >>> len(soup.find_all('a', class_='manga_img')) 44
Translating that to your specific code sample using urllib
, you would specify the parser thus:
soup = BeautifulSoup(page, 'html.parser') # BeatifulSoup can do the reading
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With