Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup Using Regex to Find Tags?

I'd really like to be able to allow Beautiful Soup to match any list of tags, like so. I know attr accepts regex, but is there anything in beautiful soup that allows you to do so?

soup.findAll("(a|div)") 

Output:

<a> ASDFS <div> asdfasdf <a> asdfsdf 

My goal is to create a scraper that can grab tables from sites. Sometimes tags are named inconsistently, and I'd like to be able to input a list of tags to name the 'data' part of a table.

like image 529
user3314418 Avatar asked Jul 15 '14 01:07

user3314418


People also ask

Can I use regex in Beautiful Soup?

Recipe Objective - Working with specific strings using regular expression and beautiful soup? In order to work with strings, we will use the "re" python library which is used for regular expressions. Regular Expression (regex) - A regular expression, the regex method helps to match the specified string in the data.

What is Find () method in Beautiful Soup?

find() method The find method is used for finding out the first tag with the specified name or id and returning an object of type bs4. Example: For instance, consider this simple HTML webpage having different paragraph tags.

What is the function you would use in Beautiful Soup to find all the TR tags?

We could just use find_all() again to find all the tr tags, yes, but we can also to iterate over these tags in a more straight forward manner. The children attribute returns an iterable object with all the tags right beneath the parent tag, which is table , therefore it returns all the tr tags.


2 Answers

Note that you can also use regular expressions to search in attributes of tags. For example:

import re from bs4 import BeautifulSoup  soup.find_all('a', {'href': re.compile(r'crummy\.com/')}) 

This example finds all <a> tags that link to a website containing the substring 'crummy.com'.

like image 198
Manu CJ Avatar answered Oct 05 '22 08:10

Manu CJ


find_all() is the most favored method in the Beautiful Soup search API.

You can pass a variation of filters. Also, pass a list to find multiple tags:

>>> soup.find_all(['a', 'div'])  

Example:

>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup('<html><body><div>asdfasdf</div><p><a>foo</a></p></body></html>') >>> soup.find_all(['a', 'div']) [<div>asdfasdf</div>, <a>foo</a>] 

Or you can use a regular expression to find tags that contain a or div:

>>> import re >>> soup.find_all(re.compile("(a|div)")) 
like image 33
hwnd Avatar answered Oct 05 '22 09:10

hwnd