<p>I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().</p> <p>Here is an example, when I want to find all ... part in the file, I may write something like this:</p> <pre class="prettyprint"><code>re.findall("<div>.*\</div>", result_page) </code></pre> <p>if result_page is a string <code>"<div> </div> <div> </div>"</code>, the result will be </p> <pre class="prettyprint"><code>['<div> </div> <div> </div>'] </code></pre> <p>Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?</p>

<p>Quoting the documentation, </p> <blockquote> <p>The <code>'*'</code>, <code>'+'</code>, and <code>'?'</code> qualifiers are all greedy; they match as much text as possible. Adding <code>'?'</code> after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.</p> </blockquote> <p>Just add the question mark:</p> <pre class="prettyprint"><code>In [6]: re.findall("<div>.*?</div>", result_page) Out[6]: ['<div> </div>', '<div> </div>'] </code></pre> <p>Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:</p> <pre class="prettyprint"><code>In [7]: import bs4 In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')] Out[8]: ['<div> </div>', '<div> </div>'] </code></pre>

python RE findall() return value is an entire string

Tags:

python

html

regex

web-crawler

I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().

Here is an example, when I want to find all ... part in the file, I may write something like this:

Click to copy

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

Click to copy

['<div> </div> <div> </div>']

Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?

224

asked Apr 26 '15 04:04

alvinzoo

1 Answers

Quoting the documentation,

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

Just add the question mark:

Click to copy

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:

Click to copy

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']

108

answered Nov 14 '22 21:11

vaultah

Related questions
                            
                                how to fix the "W293 blank line contains whitespace"
                            
                                Why copy.deepcopy doesn't modify the id of an object?
                            
                                Python property inheritance
                            
                                Adding a join to an SQL Alchemy expression that already has a select_from()
                            
                                Why doesn't python phonenumbers library work in this case?
                            
                                Turning a Pandas Dataframe to an array and evaluate Multiple Linear Regression Model
                            
                                Can't silence warnings that django-cms produces
                            
                                Python VLC binding- playing a playlist
                            
                                Tkinter -- how to horizontally center canvas text?
                            
                                How to convert a dict of lists to a list of tuples of key and value in python?
                            
                                Python Import Module from Decorator
                            
                                Vagrant, Flask — App not running on 10.10.10.10, 127.0.0.1
                            
                                arcpy get database path of feature class in feature dataset
                            
                                Cannot install ggplot with anaconda
                            
                                python pandas TimeStamps to local time string with daylight saving
                            
                                Matplotlib Pyplot logo/image in Plot
                            
                                Is there a way to start android emulator in Travis CI build?
                            
                                Python Beautiful Soup 'ascii' codec can't encode character u'\xa5'
                            
                                find words of length 4 using regular expression
                            
                                More pythonic alternative for getting a value in range not using min and max [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python RE findall() return value is an entire string

Tags:

python

html

regex

web-crawler

alvinzoo

People also ask

1 Answers

vaultah

Recent Activity

Donate For Us