Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Python/Beautiful Soup to extract text between two different tags?

I am trying to extract link titles to between two bolded tags on an HTML page using Python/Beautiful Soup.

The HTML snippet of what I am trying to extract is as follows:

<B>Heading Title 1:</B>&nbsp;<a href="link1">Title1</a>&nbsp;
<a href="link2">Title2</a>&nbsp;

&nbsp;

<B>Heading Title 2:</B>&nbsp;<a href="link3">Title3</a>&nbsp;
<a href="link4">Title4</a>&nbsp;
<a href="link5">Title5</a>&nbsp;

...

I am specifically looking to concatenate Title1 and Title2 (separated by a delimiter) to one entry in a list-like object, likewise for Title 3, Title 4, and Title 5, and so on. (One issue I foresee is that the number of titles is not set the same between each Heading Title.)

I've tried various approaches, including:

import requests, bs4, csv

res = requests.get('WEBSITE.html')

soup = BeautifulSoup(res.text, 'html.parser')

soupy4 = soup.select('a')

with open('output.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',', lineterminator='\n')
    for line in soupy4:
        if 'common_element_link' in line['href']:
            categories.append(line.next_element)
            writer.writerow([categories])

However, while this writes all titles to a file, it does so by directly appending each additional title like so:

['Title1']
['Title1', 'Title2']
['Title1', 'Title2', 'Title3']
['Title1', 'Title2', 'Title3', 'Title4']
...

Ideally, I want this code to do the following:

['Title1', 'Title2']
['Title3', 'Title4', 'Title5']
...

I am very much a newbie in regards to python lists and programming in general and am at a loss for how to proceed. I would appreciate any and all feedback anyone may have regarding this.

Thank you!

like image 600
loop_de_loop Avatar asked Jan 27 '23 11:01

loop_de_loop


2 Answers

You could use nth-of-type, :not pseudo class with general sibling ~ combinator. As the a tags are all siblings, I believe, in shown html, I use the b tags with nth-of-type to split the a tags between into blocks. I use the :not to remove later a siblings from current.

from bs4 import BeautifulSoup as bs

html = '''
<B>Heading Title 1:</B>&nbsp;<a href="link1">Title1</a>&nbsp;
<a href="link2">Title2</a>&nbsp;

&nbsp;

<B>Heading Title 2:</B>&nbsp;<a href="link3">Title3</a>&nbsp;
<a href="link4">Title4</a>&nbsp;
<a href="link5">Title5</a>&nbsp;
'''
soup = bs(html, 'lxml')
items = soup.select('b:has(~a)')
length = len(items)
if length == 1:
    row = [item.text for item in soup.select('b ~ a')]
    print(row)
elif length > 1:
    for i in range(1, length + 1):
        row = [item.text for item in soup.select('b:nth-of-type(' + str(i) + ') ~ a:not(b:nth-of-type(' + str(i + 1) + ') ~ a)')]
        print(row)

output:

enter image description here

like image 172
QHarr Avatar answered May 11 '23 12:05

QHarr


You can use itertools.groupby to combine all link text between headings:

import itertools, re
from bs4 import BeautifulSoup as soup
d = [[i.name, i] for i in soup(content, 'html.parser').find_all(re.compile('b|a'))]
new_d = [[a, list(b)] for a, b in itertools.groupby(d, key=lambda x:x[0] == 'b')]
final_result = [[c.text for _, c in b] for a, b in new_d if not a]

Output:

[['Title1', 'Title2'], ['Title3', 'Title4', 'Title5']]

The original find_all call works as a "flattener" and creates a list of lists with the target tag names and content. itertools.groupby has a key that groups based on whether the tag name is for bold content. Thus, a final pass can be made over new_d, ignoring b groups, and extracting the text from the links.

like image 36
Ajax1234 Avatar answered May 11 '23 12:05

Ajax1234