Trying to match links that contain certain texts. I'm doing
links = soup.find_all('a',href=lambda x: ".org" in x)
But that throws a TypeError: argument of type 'NoneType' is not iterable.
The correct way of doing it is apparantly
links = soup.find_all('a',href=lambda x: x and ".org" in x)
Why is the additional x and necessary here?
There's a simple reason: One of the <a> tags in your HTML has no href property.
Here's a minimal example that reproduces the exception:
html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a', href=lambda x: ".org" in x)
# result:
# TypeError: argument of type 'NoneType' is not iterable
Now if we add a href property, the exception disappears:
html = '<html><body><a href="foo.org">bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a', href=lambda x: ".org" in x)
# result:
# [<a href="foo.org">bar</a>]
What's happening is that BeautifulSoup is trying to access the <a> tag's href property, and that returns None when the property doesn't exist:
html = '<html><body><a>bar</a></body></html>'
soup = BeautifulSoup(html, 'html.parser')
print(soup.a.get('href'))
# output: None
This is why it's necessary to allow None values in your lambda. Since None is a falsy value, the code x and ... prevents the right side of the and statement from being executed when x is None, as you can see here:
>>> None and 1/0
>>> 'foo.org' and 1/0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero
This is called short-circuiting.
That said, x and ... checks the truthiness of x, and None is not the only value that's considered falsy. So it would be more correct to compare x to None like so:
lambda x: x is not None and ".org" in x
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With