I am writing a crawler to get certain parts of a html file. But I cannot figure out how to use re.findall().
Here is an example, when I want to find all ... part in the file, I may write something like this:
re.findall("<div>.*\</div>", result_page)
if result_page is a string "<div> </div> <div> </div>"
, the result will be
['<div> </div> <div> </div>']
Only the entire string. This is not what I want, I am expecting the two divs separately. What should I do?
findall(): Finding all matches in a string/list. Regex's findall() function is extremely useful as it returns a list of strings containing all matches. If the pattern is not found, re. findall() returns an empty list.
findall() module is used to search for “all” occurrences that match a given pattern. In contrast, search() module will only return the first occurrence that matches the specified pattern. findall() will iterate over all the lines of the file and will return all non-overlapping matches of pattern in a single step.
The . findall() method iterates over a string to find a subset of characters that match a specified pattern. It will return a list of every pattern match that occurs in a given string.
Quoting the documentation,
The
'*'
,'+'
, and'?'
qualifiers are all greedy; they match as much text as possible. Adding'?'
after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched.
Just add the question mark:
In [6]: re.findall("<div>.*?</div>", result_page)
Out[6]: ['<div> </div>', '<div> </div>']
Also, you shouldn't use RegEx to parse HTML, since there're HTML parsers made exactly for that. Example using BeautifulSoup 4:
In [7]: import bs4
In [8]: [str(tag) for tag in bs4.BeautifulSoup(result_page)('div')]
Out[8]: ['<div> </div>', '<div> </div>']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With