I am having a hard time understanding this code.
I would like to extract HTML
comments using BeautifulSoup
and Python3
.
Given:
html = '''
<!-- Python is awesome -->
<!-- Lambda is confusing -->
<title>I don't grok it</title>
'''
soup = BeautifulSoup(html, 'html.parser')
I searched for solutions and most people said:
comments = soup.find_all(text= lambda text: isinstance(text, Comment))
Which in my case would result in:
[' Python is awesome ', ' Lambda is confusing ']
This is what I understand:
isinstance
asks if text
is an instance of Comment
and returns a boolean.lambda
. Takes text
as an argument and evaluates the isinstance
expression.find_all
This is what I do not understand:
text
in text=
?text
in lambda text
?html
is passed into lambda text
soup.text
returns I don't grok it
. Why is lambda text
passing <!-- Python is awesome -->
as an argument?.find_all()
goes through each line and tries to match text='<our_text>
. Instead of an actual string (like in the example down) '<our_text>'
is a lambda function that basically has a condition.
I'll explain each part of this question.
text=
html = '''
<!--Python is awesome-->
<!--Lambda is confusing-->
<title>I don't grok it</title>
'''
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all(text='Python is awesome'))
Output:
['Python is awesome']
Here text=
is only a parameter (i.e. argument) where we can pass a regex or another function or a variable or 'string'
. It just happened to be a lambda in our case. We'll explain next what the lambda does.
Lambda
This lambda function takes in text
variable as input.
We automatically feed the text of each line into the lambda-func with .find_all
lambda text: isinstance(text, Comment)
And the isinstance
checks if the first arg. text
is Comment
it either returns True OR False. Example: some_var = 'Ey man'
then I do isisntance(some_var, str)
-> True. It's a string (str).
Next, we combine both of these.
soup.find_all(text= lambda text: isinstance(text, Comment))
soup.find_all
- goes through each line <--Python is awesome..
, <--Lambda..
<title>I..
We have a condition within the .find_all(<the_condition>)
and keep the lines that fulfill that condition
The condition in our case is,
3.1. Firstly we don't check everything only the clear, plain English text and inside tags, and/or whatever string there is. That's text=
3.2. The text also has a condition, it doesn't take any text, only if a lambda function returns True, i.e. fulfills the condition of the lambda.
3.3. The lambda condition is that it has to be an instance of Comment
meaning only if it's a Comment it will return True.
Only and only if all these conditions are met we take that line and store it.
What is text in text=?
A keyword argument to the find_all function
What is text in lambda text?
The parameter for the function, same as
def <name>(text)...
What argument from html is passed into lambda text
that would be up to you, in the sample the variable Comments refers to the text to parse.
soup.text returns I don't grok it. Why is lambda text passing as an argument?
that's just an example to be replaced with real HTML
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With