Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup and Python Lambda

I am having a hard time understanding this code.

I would like to extract HTML comments using BeautifulSoup and Python3.

Given:

html = '''
       <!-- Python is awesome -->
       <!-- Lambda is confusing -->
       <title>I don't grok it</title>
       '''

soup = BeautifulSoup(html, 'html.parser')

I searched for solutions and most people said:

comments = soup.find_all(text= lambda text: isinstance(text, Comment))

Which in my case would result in:

[' Python is awesome ', ' Lambda is confusing ']

This is what I understand:

  • isinstance asks if text is an instance of Comment and returns a boolean.
  • I sort of understand lambda. Takes text as an argument and evaluates the isinstance expression.
  • You can pass a function to find_all

This is what I do not understand:

  • What is text in text=?
  • What is text in lambda text?
  • What argument from html is passed into lambda text
  • soup.text returns I don't grok it. Why is lambda text passing <!-- Python is awesome --> as an argument?
like image 671
tomordonez Avatar asked Apr 24 '18 13:04

tomordonez


2 Answers

Summary

.find_all() goes through each line and tries to match text='<our_text>. Instead of an actual string (like in the example down) '<our_text>' is a lambda function that basically has a condition.

I'll explain each part of this question.

text=

html = '''
       <!--Python is awesome-->
       <!--Lambda is confusing-->
       <title>I don't grok it</title>
       '''

soup = BeautifulSoup(html, 'html.parser')

print(soup.find_all(text='Python is awesome'))

Output:

['Python is awesome']

Here text= is only a parameter (i.e. argument) where we can pass a regex or another function or a variable or 'string'. It just happened to be a lambda in our case. We'll explain next what the lambda does.

Lambda

This lambda function takes in text variable as input.

We automatically feed the text of each line into the lambda-func with .find_all

lambda text: isinstance(text, Comment) 

And the isinstance checks if the first arg. text is Comment it either returns True OR False. Example: some_var = 'Ey man' then I do isisntance(some_var, str) -> True. It's a string (str).

Next, we combine both of these.

soup.find_all(text= lambda text: isinstance(text, Comment))

  1. soup.find_all - goes through each line <--Python is awesome.., <--Lambda.. <title>I..

  2. We have a condition within the .find_all(<the_condition>) and keep the lines that fulfill that condition

  3. The condition in our case is,

    3.1. Firstly we don't check everything only the clear, plain English text and inside tags, and/or whatever string there is. That's text=

    3.2. The text also has a condition, it doesn't take any text, only if a lambda function returns True, i.e. fulfills the condition of the lambda.

    3.3. The lambda condition is that it has to be an instance of Comment meaning only if it's a Comment it will return True.

Only and only if all these conditions are met we take that line and store it.

like image 196
innicoder Avatar answered Nov 18 '22 21:11

innicoder


What is text in text=?

A keyword argument to the find_all function

What is text in lambda text?

The parameter for the function, same as

def <name>(text)...

What argument from html is passed into lambda text

that would be up to you, in the sample the variable Comments refers to the text to parse.

soup.text returns I don't grok it. Why is lambda text passing as an argument?

that's just an example to be replaced with real HTML

like image 42
user1443098 Avatar answered Nov 18 '22 20:11

user1443098