Beautiful Soup findAll doesn't find them all

Tags:

I'm trying to parse a website and get some info with the find_all() method, but it doesn't find them all.

This is the code:

#!/usr/bin/python3  from bs4 import BeautifulSoup from urllib.request import urlopen  page = urlopen ("http://mangafox.me/directory/") # print (page.read ()) soup = BeautifulSoup (page.read ())  manga_img = soup.findAll ('a', {'class' : 'manga_img'}, limit=None)  for manga in manga_img:     print (manga['href'])

It only prints half of them...

539

asked May 01 '13 17:05

Clepto

1 Answers

Different HTML parsers deal differently with broken HTML. That page serves broken HTML, and the lxml parser is not dealing very well with it:

>>> import requests >>> from bs4 import BeautifulSoup >>> r = requests.get('http://mangafox.me/directory/') >>> soup = BeautifulSoup(r.content, 'lxml') >>> len(soup.find_all('a', class_='manga_img')) 18

The standard library html.parser has less trouble with this specific page:

>>> soup = BeautifulSoup(r.content, 'html.parser') >>> len(soup.find_all('a', class_='manga_img')) 44

Translating that to your specific code sample using urllib, you would specify the parser thus:

soup = BeautifulSoup(page, 'html.parser')  # BeatifulSoup can do the reading

124

answered Sep 26 '22 15:09

Martijn Pieters

Related questions
                            
                                Determine if given class attribute is a property or not, Python object
                            
                                Using NLTK and WordNet; how do I convert simple tense verb into its present, past or past participle form?
                            
                                Matplotlib not using latex font while text.usetex==True
                            
                                python numpy ndarray element-wise mean
                            
                                How to access the type arguments of typing.Generic?
                            
                                Django queryset filter for blank FileField?
                            
                                In what way is grequests asynchronous?
                            
                                RuntimeWarning: divide by zero encountered in log
                            
                                Pandas: peculiar performance drop for inplace rename after dropna
                            
                                How to tell which Keras model is better?
                            
                                Can Python's optparse display the default value of an option?
                            
                                multiprocessing global variable updates not returned to parent
                            
                                Concatenate sparse matrices in Python using SciPy/Numpy
                            
                                ModuleNotFoundError with pytest
                            
                                whats the fastest way to find eigenvalues/vectors in python?
                            
                                Equivalent of Paste R to Python
                            
                                Duplicating model instances and their related objects in Django / Algorithm for recusrively duplicating an object
                            
                                How do YOU deploy your WSGI application? (and why it is the best way)
                            
                                python 2 code: if python 3 then sys.exit()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Beautiful Soup findAll doesn't find them all

Tags:

python

html

python-3.x

beautifulsoup

Clepto

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us