Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python RE findall() return value is an entire string

I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().

Here is an example, when I want to find all ... part in the file, I may write something like this:

re.findall("<div>.*\</div>", result_page)

if result_page is a string "<div> </div> <div> </div>", the result will be

['<div> </div> <div> </div>']

Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?

like image 224
alvinzoo Avatar asked Apr 26 '15 04:04

alvinzoo


People also ask

What does Python re Findall return?

findall(): Finding all matches in a string/list. Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.

What is difference between Search () and Findall () methods in Python?

findall() module is used to search for “all” occurrences that match a given pattern. In contrast, search() module will only return the first occurrence that matches the specified pattern. findall() will iterate over all the lines of the file and will return all non-overlapping matches of pattern in a single step.

Does re Findall create a list?

The . findall() method iterates over a string to find a subset of characters that match a specified pattern. It will return a list of every pattern match that occurs in a given string.


1 Answers

Quoting the documentation,

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.

Just add the question mark:

In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']

Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:

In [7]: import bs4

In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
like image 108
vaultah Avatar answered Nov 14 '22 21:11

vaultah