I am trying to extract link titles to between two bolded tags on an HTML page using Python/Beautiful Soup.
The HTML snippet of what I am trying to extract is as follows:
<B>Heading Title 1:</B> <a href="link1">Title1</a>
<a href="link2">Title2</a>
<B>Heading Title 2:</B> <a href="link3">Title3</a>
<a href="link4">Title4</a>
<a href="link5">Title5</a>
...
I am specifically looking to concatenate Title1 and Title2 (separated by a delimiter) to one entry in a list-like object, likewise for Title 3, Title 4, and Title 5, and so on. (One issue I foresee is that the number of titles is not set the same between each Heading Title.)
I've tried various approaches, including:
import requests, bs4, csv
res = requests.get('WEBSITE.html')
soup = BeautifulSoup(res.text, 'html.parser')
soupy4 = soup.select('a')
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',', lineterminator='\n')
for line in soupy4:
if 'common_element_link' in line['href']:
categories.append(line.next_element)
writer.writerow([categories])
However, while this writes all titles to a file, it does so by directly appending each additional title like so:
['Title1']
['Title1', 'Title2']
['Title1', 'Title2', 'Title3']
['Title1', 'Title2', 'Title3', 'Title4']
...
Ideally, I want this code to do the following:
['Title1', 'Title2']
['Title3', 'Title4', 'Title5']
...
I am very much a newbie in regards to python lists and programming in general and am at a loss for how to proceed. I would appreciate any and all feedback anyone may have regarding this.
Thank you!
You could use nth-of-type
, :not
pseudo class with general sibling ~
combinator. As the a
tags are all siblings, I believe, in shown html, I use the b
tags with nth-of-type to split the a
tags between into blocks. I use the :not
to remove later a
siblings from current.
from bs4 import BeautifulSoup as bs
html = '''
<B>Heading Title 1:</B> <a href="link1">Title1</a>
<a href="link2">Title2</a>
<B>Heading Title 2:</B> <a href="link3">Title3</a>
<a href="link4">Title4</a>
<a href="link5">Title5</a>
'''
soup = bs(html, 'lxml')
items = soup.select('b:has(~a)')
length = len(items)
if length == 1:
row = [item.text for item in soup.select('b ~ a')]
print(row)
elif length > 1:
for i in range(1, length + 1):
row = [item.text for item in soup.select('b:nth-of-type(' + str(i) + ') ~ a:not(b:nth-of-type(' + str(i + 1) + ') ~ a)')]
print(row)
output:
You can use itertools.groupby
to combine all link text between headings:
import itertools, re
from bs4 import BeautifulSoup as soup
d = [[i.name, i] for i in soup(content, 'html.parser').find_all(re.compile('b|a'))]
new_d = [[a, list(b)] for a, b in itertools.groupby(d, key=lambda x:x[0] == 'b')]
final_result = [[c.text for _, c in b] for a, b in new_d if not a]
Output:
[['Title1', 'Title2'], ['Title3', 'Title4', 'Title5']]
The original find_all
call works as a "flattener" and creates a list of lists with the target tag names and content. itertools.groupby
has a key that groups based on whether the tag name is for bold content. Thus, a final pass can be made over new_d
, ignoring b
groups, and extracting the text from the links.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With