Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup findAll doesn't find them all

I'm trying to parse a website and get some info with the find_all() method, but it doesn't find them all.

This is the code:

#!/usr/bin/python3  from bs4 import BeautifulSoup from urllib.request import urlopen  page = urlopen ("http://mangafox.me/directory/") # print (page.read ()) soup = BeautifulSoup (page.read ())  manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)  for manga in manga_img:     print (manga['href']) 

It only prints half of them...

like image 539
Clepto Avatar asked May 01 '13 17:05

Clepto


People also ask

What is Findall in BeautifulSoup?

Beautiful Soup provides "find()" and "find_all()" functions to get the specific data from the HTML file by putting the specific tag in the function. find() function - return the first element of given tag. find_all() function - return the all the element of given tag.

What is the difference between find and Findall in BeautifulSoup?

find is used for returning the result when the searched element is found on the page. find_all is used for returning all the matches after scanning the entire document.


1 Answers

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get('http://mangafox.me/directory/') >>> soup = BeautifulSoup(r.content, 'lxml') >>> len(soup.find_all('a', class_='manga_img')) 18 

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.content, 'html.parser') >>> len(soup.find_all('a', class_='manga_img')) 44 

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading 
like image 124
Martijn Pieters Avatar answered Sep 26 '22 15:09

Martijn Pieters