I am trying to scrape the title and price of a product. I am facing a problem where the website has a class that varies. This is an example,
<a class="G-ab" href="thewebsite.com"><div class="G-l"><div class="G-m">Product Name</div></div><div class="G-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>
When i use another computer, it then shows this instead,
<a class="K-ab" href="thewebsite.com"><div class="K-l"><div class="K-m">Product Name</div></div><div class="K-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>
I realized that they change their classes to a random letter. I am currently using BeautifulSoup4 and requests library. Are there any ways to get the class, other than the thought of making a whole long "if-elif" classes? The website I am trying to scrape is carousell.com I am currently using an lxml parser, if that would be of any help. Thank you for your time.
BeautifulSoup allow you to use a regex as the filter. In your site the class names of a
tag have -ab
in it.
You can use
soup.find_all('a',class_=re.compile("-ab"))
But in some case there need not be any common terms in the class names, the you could check if you can try to use methods in Going back and forth , Going sideways , Going down and Going up sections of the documentation to somehow uniquely identify the element you need without relying on the class name.
Coming back to your Question
html="""
<a class="G-ab" href="thewebsite.com"><div class="G-l"><div class="G-m">Product Name</div></div><div class="G-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>
<a class="K-ab" href="thewebsite.com"><div class="K-l"><div class="K-m">Product Name</div></div><div class="K-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>
"""
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,'html.parser')
a_links=soup.find_all('a',class_=re.compile("-ab"))
print(a_links)
Outputs:
[<a class="G-ab" href="thewebsite.com"><div class="G-l"><div class="G-m">Product Name</div></div><div class="G-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>, <a class="K-ab" href="thewebsite.com"><div class="K-l"><div class="K-m">Product Name</div></div><div class="K-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>]
Both the a
tags with different class names containing -ab
were selected.
Yes what @Bitto mentioned is correct.You have use Regular expression to identify unique elements.Using re
you can achieve this.However here is your code.You can use pandas Dataframe
to print the results.
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
html=requests.get("https://carousell.com/search/products/?cc_id=2195&query=I7&sort_by=time_created%2Cdescending")
soup=BeautifulSoup(html.text,"html.parser")
atag=soup.find_all('a', class_=re.compile("-ab"))
itemtitle=[]
itemprice=[]
for a in atag:
for title,price in zip(a.find_all('div', class_=re.compile("-m")),a.find_all('div', class_=re.compile("-k"))):
itemtitle.append(title.text)
itemprice.append(price.find('div').text)
df=pd.DataFrame({"Title" :itemtitle, "Price" : itemprice})
print(df)
Output:
Price Title
0 £200 Acer Aspire Laptop (Used)
1 £700 MSI GP62 LEOPARD i7 12gb Ram windows 10
2 £120 Apple MacBook Pro
3 £155 iPhone 7 Plus
4 £155 Goophone I7 Plus
5 £579 MacBook Air 13.3inch 2014 i7 1.7GHz 8GB Ram 12...
6 £550 MacBook Pro 2016 16GB Ram
7 £600 CUSTOM GAMING/MEDIA PC COMPUTER
8 £900 MS I GE62 2QF-419UK APACHE/PRO TRUE FIRE POWER
9 £390 HP Envy 15 Intel Core i7 4000MQ 12GB Ram
10 £188 Goophone I7 Plus
11 £650 Apple IMac 27" i7 2.8Ghz Quad Core
12 £600 Custom Gaming Pc (Excellent Condition)
13 £499 iMac 21.5inch with wireless keyboard
14 £1,299 MacBook Pro Retina 13 Inches AppleCare
15 £700 I7 4790k Water Cooled
16 £650 Gigabyte P15V2
17 £280 Two Monitors i7 PC
18 £250 Gaming laptop pro
19 £1,000 MAC BOOK PRO 15 Ritna
20 £550 Apple MacBook Pro Laptop - A1286 15.2" 500 GB ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With