I've rechecked my code and looked at comparable operations on opening a URL to pass web data into Beautiful Soup, for some reason my code just doesn't return anything although it's in correct form:
>>> from bs4 import BeautifulSoup
>>> from urllib3 import poolmanager
>>> connectBuilder = poolmanager.PoolManager()
>>> content = connectBuilder.urlopen('GET', 'http://www.crummy.com/software/BeautifulSoup/')
>>> content
<urllib3.response.HTTPResponse object at 0x00000000032EC390>
>>> soup = BeautifulSoup(content)
>>> soup.title
>>> soup.title.name
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'name'
>>> soup.p
>>> soup.get_text()
''
>>> content.data
a stream of data follows...
As shown, it's clear that urlopen() returns an HTTP response which is captured by the variable content, it makes sense that it can read the status of the response, but after it's passed into Beautiful Soup, the web data doesn't get converted into a Beautiful Soup object (variable soup). You can see that I've tried to read a few tags and text, the get_text() returns an empty list, this is strange.
Strangely, when I access the web data via content.data, the data shows up but it's not useful since I can't use Beautiful Soup to parse it. What is my problem? Thanks.
If you just want to scrape the page, requests
will get the content you need:
from bs4 import BeautifulSoup
import requests
r = requests.get('http://www.crummy.com/software/BeautifulSoup/')
soup = BeautifulSoup(r.content)
In [59]: soup.title
Out[59]: <title>Beautiful Soup: We called him Tortoise because he taught us.</title>
In [60]: soup.title.name
Out[60]: 'title'
urllib3 returns a Response object, which contains the .data
which has the preloaded body payload.
Per the top quickstart usage example here, I would do something like this:
import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/')
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.data) # Note the use of the .data property
...
The rest should work as intended.
--
A little about what went wrong in your original code:
You passed the entire response
object rather than the body payload. This should normally be fine because the response
object is a file-like object, except in this case urllib3 already consumes all of the response and parses it for you, so that there is nothing left to .read()
. It's like passing a filepointer which has already been read. .data
on the other hand will access the already-read data.
If you want to use urllib3 response objects as file-like objects, you'll need to disable content preloading, like this:
response = http.request('GET', 'http://www.crummy.com/software/BeautifulSoup/', preload_content=False)
soup = BeautifulSoup(response) # We can pass the original `response` object now.
Now it should work as you expected.
I understand that this is not very obvious behaviour, and as the author of urllib3 I apologize. :) We plan to make preload_content=False
the default someday. Perhaps someday soon (I opened an issue here).
--
A quick note on .urlopen
vs .request
:
.urlopen
assumes that you will take care of encoding any parameters passed to the request. In this case it's fine to use .urlopen
because you're not passing any parameters to the request, but in general .request
will do all the extra work for you so it's more convenient.
If anyone would be up for improving our documentation to this effect, that would be greatly appreciated. :) Please send a PR to https://github.com/shazow/urllib3 and add yourself as a contributor!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With