Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape content from a div class based on data-automation attribute in Python using BeautifulSoup?

I am trying to scrape a dynamic page using BeautifulSoup. After accessing the said page from https://www.nemlig.com/ with the help of Selenium (and thanks to the code advice from @cruisepandey) like this:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC 
from bs4 import BeautifulSoup


driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
wait = WebDriverWait(driver,10)

driver.maximize_window()
driver.get("https://www.nemlig.com/")

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys('2300')  
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()

I am prompted with this page that I want to scrape.

enter image description here

More precisely, at this point, I want to scrape the rows from the right-hand side of the page. If you look through the HTML code behind these you will notice that the div class time-block__row has 3 different data-automation attributes for the main 3 times of the day.

<div class="time-block__row" data-automation="beforDinnerRowTmSlt">
                            <div class="time-block__row-header">Formiddag</div>

                            <div class="no-timeslots ng-hide" ng-show="$ctrl.timeslotDays[$ctrl.selectedDateIndex].morningHours == 0">
                                Ingen levering..
                            </div>

                            <!----><!----><div class="time-block__item duration-1 disabled" ng-repeat="item in $ctrl.selectedHours track by $index" ng-if="item.StartHour >= 0 &amp;&amp; item.StartHour < 12" ng-click="$ctrl.setActiveTimeslot(item, $index)" ng-class="['duration-1', {'cheapest': item.IsCheapHour, 'event': item.IsEventSlot, 'selected': $ctrl.selectedTimeId == item.Id || $ctrl.selectedTimeIndex == $index, 'disabled': item.isUnavailable()}]" data-automation="notActiveSltTmSlt">

                                <div class="time-block__inner-container">
                <div class="time-block__time">8-9</div>
                <div class="time-block__attributes">
                  <!----></div>
                                    <div class="time-block__cost">29&nbsp;kr.</div>

So Formiddag (Morning) has data-automation = "beforDinnerRowTmSlt", Eftermiddag (Afternoon) has data-automation = "afternoonRowTmSlt" and Aften (Evening) has data-automation = "eveningRowTmSlt".

page_source = wait.until(driver.page_source)
soup = BeautifulSoup(page_source)
   
time_of_the_day = soup.find('div', class_='time-block__row').text
  • The problem is

using the code above, time_of_the_day only contains information from the Morning rows.

How can I scrape these rows properly using the data-automation attribute? How can I possibly access the other 2 div classes and their child divs? My plan is to create a dataframe containing something like this:

Time_of_the_day          Hours          Price        Day
Formiddag                8-9            29kr.        Tor. 10/10
....                     ....           ....         ....
Eftermiddag              12-13          29kr.        Tor. 10/10
....                     ....           ....         ....

The day column will contain the output from here: day = soup.find('div', class_='content').text

I know this is quite a lengthy post but hopefully I've made it easy to understand the task and you will be able to help me out with advice, tips or code!

like image 360
Questieme Avatar asked Mar 20 '26 07:03

Questieme


2 Answers

Here is code to get all those values.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import pandas as pd

driver = webdriver.Chrome(executable_path = r'C:\Users\user\lib\chromedriver_77.0.3865.40.exe')
wait = WebDriverWait(driver,10)
driver.maximize_window()
driver.get("https://www.nemlig.com/")

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ".timeslot-prompt.initial-animation-done")))
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "input[type='tel'][class^='pro']"))).send_keys('2300')
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn.prompt__button"))).click()
time.sleep(3)
soup=BeautifulSoup(driver.page_source,'html.parser')
time_of_day=[]
price=[]
Hours=[]
day=[]
for morn in soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'):
    time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
    Hours.append(morn.text)
    price.append(morn.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
print(df)

time_of_day=[]
price=[]
Hours=[]
day=[]

for after in soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'):
    time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
    Hours.append(after.text)
    price.append(after.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
print(df)

time_of_day=[]
price=[]
Hours=[]
day=[]

for evenin in soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'):
    time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
    Hours.append(evenin.text)
    price.append(evenin.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day})
print(df)

Output:

         Day  Hours   price time_of_day
0  fre. 11/10    8-9  29 kr.   Formiddag
1  fre. 11/10   9-10  29 kr.   Formiddag
2  fre. 11/10  10-11  39 kr.   Formiddag
3  fre. 11/10  11-12  39 kr.   Formiddag
          Day  Hours   price  time_of_day
0  fre. 11/10  12-13  29 kr.  Eftermiddag
1  fre. 11/10  13-14  29 kr.  Eftermiddag
2  fre. 11/10  14-15  29 kr.  Eftermiddag
3  fre. 11/10  15-16  29 kr.  Eftermiddag
4  fre. 11/10  16-17  29 kr.  Eftermiddag
5  fre. 11/10  17-18  19 kr.  Eftermiddag
          Day  Hours   price time_of_day
0  fre. 11/10  18-19  29 kr.       Aften
1  fre. 11/10  19-20  19 kr.       Aften
2  fre. 11/10  20-21  29 kr.       Aften
3  fre. 11/10  21-22  19 kr.       Aften

Edited

soup=BeautifulSoup(driver.page_source,'html.parser')
time_of_day=[]
price=[]
Hours=[]
day=[]
disabled=[]

for morn,d in zip(soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="beforDinnerRowTmSlt"]').select('.time-block__item')):

    time_of_day.append(soup.select_one('[data-automation="beforDinnerRowTmSlt"] > .time-block__row-header').text)
    Hours.append(morn.text)
    price.append(morn.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    if 'disabled' in d['class']:
        disabled.append('1')
    else:
        disabled.append('0')

for after,d in zip(soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="afternoonRowTmSlt"]').select('.time-block__item')):
    time_of_day.append(soup.select_one('[data-automation="afternoonRowTmSlt"] > .time-block__row-header').text)
    Hours.append(after.text)
    price.append(after.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    if 'disabled' in d['class']:
        disabled.append('1')
    else:
        disabled.append('0')

for evenin,d in zip(soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__time'),soup.select_one('[data-automation="eveningRowTmSlt"]').select('.time-block__item')):
    time_of_day.append(soup.select_one('[data-automation="eveningRowTmSlt"] > .time-block__row-header').text)
    Hours.append(evenin.text)
    price.append(evenin.find_next(class_="time-block__cost").text)
    day.append(soup.select_one('.date-block.selected [data-automation="dayNmTmSlt"]').text + " " + soup.select_one('.date-block.selected [data-automation="dayDateTmSlt"]').text)
    if 'disabled' in d['class']:
        disabled.append('1')
    else:
        disabled.append('0')

df = pd.DataFrame({"time_of_day":time_of_day,"Hours":Hours,"price":price,"Day":day,"Disabled" : disabled})
print(df)

Output:

           Day Disabled  Hours   price  time_of_day
0   fre. 11/10        1    8-9  29 kr.    Formiddag
1   fre. 11/10        1   9-10  29 kr.    Formiddag
2   fre. 11/10        0  10-11  39 kr.    Formiddag
3   fre. 11/10        0  11-12  39 kr.    Formiddag
4   fre. 11/10        0  12-13  29 kr.  Eftermiddag
5   fre. 11/10        0  13-14  29 kr.  Eftermiddag
6   fre. 11/10        0  14-15  19 kr.  Eftermiddag
7   fre. 11/10        0  15-16  29 kr.  Eftermiddag
8   fre. 11/10        0  16-17  29 kr.  Eftermiddag
9   fre. 11/10        0  17-18  29 kr.  Eftermiddag
10  fre. 11/10        0  18-19  29 kr.        Aften
11  fre. 11/10        0  19-20  19 kr.        Aften
12  fre. 11/10        0  20-21  29 kr.        Aften
13  fre. 11/10        0  21-22  19 kr.        Aften
like image 52
KunduK Avatar answered Mar 21 '26 20:03

KunduK


You can use soup.find_all:

from bs4 import BeautifulSoup as soup
import re
... #rest of your current selenium code

d = soup(driver.page_source, 'html.parser')
r, _day = [[i.div.text, [['disabled' in k['class'], k.find_all('div', {'class':re.compile('time-block__time|ime-block__cost')})] for k in i.find_all('div', {'class':'time-block__item'})]] for i in d.find_all('div', {'class':'time-block__row'})], d.find('div', {'class':'content'}).get_text(strip=True)
new_r = [[a, [[int(j), *[i.text for i in b]] for j, b in k]] for a, k in r]
new_data = [[a, *i, _day] for a, b in new_r for i in b]

To convert your results to a dataframe:

import pandas as pd
df = pd.DataFrame([dict(zip(['Time_of_the_day', 'Disabled', 'Hours', 'Price', 'Day'], i)) for i in new_data])

Output:

      Day  Disabled  Hours   Price Time_of_the_day
0   fre.11/10         1    8-9  29 kr.       Formiddag
1   fre.11/10         1   9-10  29 kr.       Formiddag
2   fre.11/10         1  10-11  39 kr.       Formiddag
3   fre.11/10         0  11-12  39 kr.       Formiddag
4   fre.11/10         0  12-13  29 kr.     Eftermiddag
....
like image 34
Ajax1234 Avatar answered Mar 21 '26 21:03

Ajax1234



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!