Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Website to be scraped has varying class names

I am trying to scrape the title and price of a product. I am facing a problem where the website has a class that varies. This is an example,

<a class="G-ab" href="thewebsite.com"><div class="G-l"><div class="G-m">Product Name</div></div><div class="G-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>

When i use another computer, it then shows this instead,

<a class="K-ab" href="thewebsite.com"><div class="K-l"><div class="K-m">Product Name</div></div><div class="K-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>

I realized that they change their classes to a random letter. I am currently using BeautifulSoup4 and requests library. Are there any ways to get the class, other than the thought of making a whole long "if-elif" classes? The website I am trying to scrape is carousell.com I am currently using an lxml parser, if that would be of any help. Thank you for your time.

like image 352
Dante Takeshii Avatar asked Mar 16 '19 17:03

Dante Takeshii


2 Answers

BeautifulSoup allow you to use a regex as the filter. In your site the class names of a tag have -ab in it.

You can use

soup.find_all('a',class_=re.compile("-ab"))

But in some case there need not be any common terms in the class names, the you could check if you can try to use methods in Going back and forth , Going sideways , Going down and Going up sections of the documentation to somehow uniquely identify the element you need without relying on the class name.

Coming back to your Question

html="""
<a class="G-ab" href="thewebsite.com"><div class="G-l"><div class="G-m">Product Name</div></div><div class="G-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>
<a class="K-ab" href="thewebsite.com"><div class="K-l"><div class="K-m">Product Name</div></div><div class="K-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>
"""
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(html,'html.parser')
a_links=soup.find_all('a',class_=re.compile("-ab"))
print(a_links)

Outputs:

[<a class="G-ab" href="thewebsite.com"><div class="G-l"><div class="G-m">Product Name</div></div><div class="G-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>, <a class="K-ab" href="thewebsite.com"><div class="K-l"><div class="K-m">Product Name</div></div><div class="K-k"><div>S$230</div><div>Product Description</div><div>Used</div></div></a>]

Both the a tags with different class names containing -ab were selected.

like image 65
Bitto Bennichan Avatar answered Sep 17 '22 06:09

Bitto Bennichan


Yes what @Bitto mentioned is correct.You have use Regular expression to identify unique elements.Using re you can achieve this.However here is your code.You can use pandas Dataframe to print the results.

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

html=requests.get("https://carousell.com/search/products/?cc_id=2195&query=I7&sort_by=time_created%2Cdescending")
soup=BeautifulSoup(html.text,"html.parser")
atag=soup.find_all('a', class_=re.compile("-ab"))
itemtitle=[]
itemprice=[]
for a in atag:
  for title,price in zip(a.find_all('div', class_=re.compile("-m")),a.find_all('div', class_=re.compile("-k"))):
      itemtitle.append(title.text)
      itemprice.append(price.find('div').text)

df=pd.DataFrame({"Title" :itemtitle, "Price" : itemprice})
print(df)

Output:

     Price                                              Title
0     £200                          Acer Aspire Laptop (Used)
1     £700            MSI GP62 LEOPARD i7 12gb Ram windows 10
2     £120                                  Apple MacBook Pro
3     £155                                      iPhone 7 Plus
4     £155                                   Goophone I7 Plus
5     £579  MacBook Air 13.3inch 2014 i7 1.7GHz 8GB Ram 12...
6     £550                          MacBook Pro 2016 16GB Ram
7     £600                    CUSTOM GAMING/MEDIA PC COMPUTER
8     £900     MS I GE62 2QF-419UK APACHE/PRO TRUE FIRE POWER
9     £390           HP Envy 15 Intel Core i7 4000MQ 12GB Ram
10    £188                                   Goophone I7 Plus
11    £650                 Apple IMac 27" i7 2.8Ghz Quad Core
12    £600             Custom Gaming Pc (Excellent Condition)
13    £499               iMac 21.5inch with wireless keyboard
14  £1,299             MacBook Pro Retina 13 Inches AppleCare
15    £700                              I7 4790k Water Cooled
16    £650                                     Gigabyte P15V2
17    £280                                 Two Monitors i7 PC
18    £250                                  Gaming laptop pro
19  £1,000                              MAC BOOK PRO 15 Ritna
20    £550  Apple MacBook Pro Laptop - A1286 15.2" 500 GB ...
like image 37
KunduK Avatar answered Sep 19 '22 06:09

KunduK