Python list processing to extract substrings

Question

I parsed an HTML page via beautifulsoup, extracting all div elements with specific class names into a list.

I now have to clean out HTML strings from this list, leaving behind string tokens I need.

The list I start with looks like this:

[<div class="info-1">
Name1a    <span class="bold">Score1a</span>
</div>, <div class="info-2">
Name1b    <span class="bold">Score1b</span>
</div>, <div class="info-1">
Name2a    <span class="bold">Score2a</span>
</div>, <div class="info-2">
Name2b    <span class="bold">Score2b</span>
</div>, <div class="info-1">
Name3a    <span class="bold">Score3a</span>
</div>, <div class="info-2">
Name3b    <span class="bold">Score3b</span>
</div>]

The whitespaces are deliberate. I need to reduce that list to:

[('Name1a', 'Score1a'), ('Name1b', 'Score1b'), ('Name2a', 'Score2a'), ('Name2b', 'Score2b'), ('Name3a', 'Score3a'), ('Name3b', 'Score3b')]

What's an efficient way to parse out substrings like this?

I've tried using the split method (e.g. [item.split('<div class="info-1"> ',1) for item in string_list]), but splitting just results in a substring that requires further splitting (hence inefficient). Likewise for using replace.

I feel I ought to go the other way around and extract the tokens I need, but I can't seem to wrap my head around an elegant way to do this. Being new to this hasn't helped either. I appreicate your help.

宏杰李 · Accepted Answer

Do not convert BS object to string unless you really need to do that.
Use CSS selector to find the class that starts with info
Use stripped_strings to get all the non-empty strings under a tag
Use tuple() to convert an iterable to tuple object

import bs4

html = '''<div class="info-1">
Name1a    <span class="bold">Score1a</span>
</div>, <div class="info-2">
Name1b    <span class="bold">Score1b</span>
</div>, <div class="info-1">
Name2a    <span class="bold">Score2a</span>
</div>, <div class="info-2">
Name2b    <span class="bold">Score2b</span>
</div>, <div class="info-1">
Name3a    <span class="bold">Score3a</span>
</div>, <div class="info-2">
Name3b    <span class="bold">Score3b</span>
</div>'''

soup = bs4.BeautifulSoup(html, 'lxml')

for div in soup.select('div[class^="info"]'):
    t = tuple(text for text in div.stripped_strings)
    print(t)

out:

('Name1a', 'Score1a')
('Name1b', 'Score1b')
('Name2a', 'Score2a')
('Name2b', 'Score2b')
('Name3a', 'Score3a')
('Name3b', 'Score3b')

Python list processing to extract substrings

Tags:

python

beautifulsoup

Hassan Baig

1 Answers

宏杰李

Recent Activity

Donate For Us

Python list processing to extract substrings

Tags:

python

beautifulsoup

Hassan Baig

1 Answers

宏杰李

Related questions

Recent Activity

Donate For Us