Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautifulsoup parsing data under specific tag

Right now I am parsing a web page with this code:

boards = soup(itemprop="name")
prices = soup("span", { "class" : "price-currency" })

for board, price in zip(boards, prices):
    print(board.text.strip(), price.next_sibling)

And it prints the board and the price like this:

SURFBOARD RACK free delivery to your door 120.00
Huge Beginner Surfboard Sale! Kids & Adult Softboards all 1/2 Price!! 90.00
Mega Softboard Clearance Sale! Beginner Foam SurfBoards 1/2 Price! 90.00
Surfboard 6'2" Simon Anderson Spudnick 360.00
Surfboard Cover, Surfboard Bags, Cheap Single Surf Board Bags 50.00

The web page that I am parsing is split into 3 sections: sponsored links, top ads, and recent ads. I am printing data from all 3 of these sections, but want data only from the recent ads section, which has this html:

<div class="module__body ad-listing">

How do I specify that I only want the boards and prices printed from beneath this section?

Page: https://www.gumtree.com.au/s-surfing/mona-vale-sydney/surfboard/k0c18568l3003999r10?fromSearchBox=true

like image 473
Frank Harb Avatar asked Dec 05 '25 10:12

Frank Harb


1 Answers

You may detest this answer. My inclination is to use the lxml module when I see complicated HTML like that because I can use xpath expressions.

In this case the first xpath finds the collection of li elements in the HTML that you want. The loop uses two xpath expressions, one that finds stuff like "Quicksale 6'4 Dylan Surfboard RX5" within an li element and one that finds the collection of texts for price information within the same element. Item 12 seems to be coded differently; I haven't investigated that.

>>> import requests
>>> from lxml import etree
>>> page = requests.get('https://www.gumtree.com.au/s-surfing/mona-vale-sydney/surfboard/k0c18568l3003999r10?fromSearchBox=true').text
>>> parser = etree.HTMLParser()
>>> tree = etree.fromstring(page, parser=parser)
>>> recents = tree.xpath('.//div[@class="module__body ad-listing"]/ul/li')
>>> for i, recent in enumerate(recents):
...     try:
...         i, recent.xpath('.//span[@itemprop="name"]/text()')[0].strip()
...     except:
...         '-------------> item', i, 'failed'
...         continue
...     one_span = first_recent.xpath('.//span[@class="j-original-price"]')[0]
...     ' '.join([_.strip() for _ in list(one_span.itertext()) if _.strip()])
... 
(0, "Quicksale 6'4 Dylan Surfboard RX5")
'$ 450.00 Negotiable'
(1, 'DHD 5\'9 "Switchblade" Surfboard')
'$ 450.00 Negotiable'
(2, '6ft Modern Surfboards - Highline')
'$ 450.00 Negotiable'
(3, "5'11 Channel Island T-Low surfboard")
'$ 450.00 Negotiable'
(4, 'Chill Rare Bird Surfboard 5"8')
'$ 450.00 Negotiable'
(5, 'Vintage surfboard')
'$ 450.00 Negotiable'
(6, "5'7 Annesley Blonde model")
'$ 450.00 Negotiable'
(7, 'McCoy single fin surfboard')
'$ 450.00 Negotiable'
(8, 'Sculpt surfboard')
'$ 450.00 Negotiable'
(9, '8\'1" longboard surfboard travel cover')
'$ 450.00 Negotiable'
(10, 'Longboard Surfboard')
'$ 450.00 Negotiable'
(11, "5'10 Custom Chaos Surfboard")
'$ 450.00 Negotiable'
('-------------> item', 12, 'failed')
(13, "6'0 JS lowdown")
'$ 450.00 Negotiable'
(14, 'Mega Softboard Clearance Sale! Beginner Foam SurfBoards 1/2 Price!')
'$ 450.00 Negotiable'
(15, 'Surfboard')
'$ 450.00 Negotiable'
(16, 'Surfboard 5\'10" 30 lt')
'$ 450.00 Negotiable'
(17, 'Christenson Super Sport Surfboard')
'$ 450.00 Negotiable'
(18, 'TOMO Firewire V4 Surfboard')
'$ 450.00 Negotiable'
(19, "Surfboard 6'6 baked bean")
'$ 450.00 Negotiable'
(20, 'foam surfboards')
'$ 450.00 Negotiable'
(21, 'Channel Islands surfboard')
'$ 450.00 Negotiable'
(22, 'Channel Islands Surfboard')
'$ 450.00 Negotiable'
(23, 'JS surfboard')
'$ 450.00 Negotiable'
(24, 'CLASSIC RETRO SURF FACTORY MINI MAL')
'$ 450.00 Negotiable'
(25, 'Surfboard JS')
'$ 450.00 Negotiable'
like image 52
Bill Bell Avatar answered Dec 07 '25 01:12

Bill Bell



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!