Scraping text in h3 and div tags using beautifulSoup, Python

Tags:

I have no experience with python, BeautifulSoup, Selenium etc. but I'm eager to scrape data from a website and store as a csv file. A single sample of data I need is coded as follows (a single row of data).

<div class="box effect">
<div class="row">
<div class="col-lg-10">
    <h3>HEADING</h3>
        <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
        <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
        <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
        <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
    <div class="space">&nbsp;</div>

<div style="padding:10px;padding-left:0px;"><a class="btn btn-primary btn-sm" href="www.link_to_another_page.com"><i class="fa search-plus"></i> &nbsp;more info</a></div>

</div>
<div class="col-lg-2">

</div>
</div>
</div>

The output I need is Heading,NAME,MOBILE,NUMBER,XYZ_ADDRESS

I found those data don't have a id or class yet being in website as general text. I'm trying BeautifulSoup and Python Selenium separately for that, where I got stuck to extract in both the methods as no tutorials I saw, guided me to extract text from these and tags

My code using BeautifulSoup

import urllib2
from bs4 import BeautifulSoup
import requests
import csv

MAX = 2

'''with open("lg.csv", "a") as f:
  w=csv.writer(f)'''
##for i in range(1,MAX+1)
url="http://www.example_site.com"

page=requests.get(url)
soup = BeautifulSoup(page.content,"html.parser")

for h in soup.find_all('h3'):
    print(h.get('h3'))

My selenium code

import csv
from selenium import webdriver
MAX_PAGE_NUM = 2
driver = webdriver.Firefox()
for i in range(1, MAX_PAGE_NUM+1):
  url = "http://www.example_site.com"
  driver.get(url)
  name = driver.find_elements_by_xpath('//div[@class = "col-lg-10"]/h3')
  #contact = driver.find_elements_by_xpath('//span[@class="item-price"]')
#  phone = 
#  mobile = 
#  address =
#  print(len(buyers))
#  num_page_items = len(buyers)
#  with open('res.csv','a') as f:
#    for i in range(num_page_items):
#      f.write(buyers[i].text + "," + prices[i].text + "\n")
  print (name)          
driver.close()

888

asked Oct 25 '17 13:10

Revaapriyan

2 Answers

You can use CSS selectors to find the data you need. In your case div > h3 ~ div will find all div elements that are directly inside a div element and are proceeded by a h3 element.

import bs4

page= """
<div class="box effect">
<div class="row">
<div class="col-lg-10">
    <h3>HEADING</h3>
    <div><i class="fa user"></i>&nbsp;&nbsp;NAME</div>
    <div><i class="fa phone"></i>&nbsp;&nbsp;MOBILE</div>
    <div><i class="fa mobile-phone fa-2"></i>&nbsp;&nbsp;&nbsp;NUMBER</div>
    <div><i class="fa address"></i>&nbsp;&nbsp;&nbsp;XYZ_ADDRESS</div>
</div>
</div>
</div>
"""

soup = bs4.BeautifulSoup(page, 'lxml')

# find all div elements that are inside a div element
# and are proceeded by an h3 element
selector = 'div > h3 ~ div'

# find elements that contain the data we want
found = soup.select(selector)

# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]

for x in data:
    print(x)

Edit: To scrape the text in heading..

heading = soup.find('h3') 
heading_data = heading.text
print(heading_data)

Edit: Or you can get the heading and other div elements at once by using a selector like this: div.col-lg-10 > *. This finds all elements inside a div element that belongs to col-lg-10 class.

soup = bs4.BeautifulSoup(page, 'lxml')

# find all elements inside a div element of class col-lg-10
selector = 'div.col-lg-10 > *'

# find elements that contain the data we want
found = soup.select(selector)

# Extract data from the found elements
data = [x.text.split(';')[-1].strip() for x in found]

for x in data:
    print(x)

156

answered Oct 03 '22 13:10

Anonta

So it seemed quite nice:

    #  -*- coding: utf-8 -*-
    # by Faguiro #
    # run using Python 3.8.6  on Linux#
    import requests
    from bs4 import BeautifulSoup

    # insert your site here
    url= input("Enter the url-->")

    #use requests
    r = requests.get(url)
    content = r.content

    #soup!
    soup = BeautifulSoup(content, "html.parser")

    #find all tag in the soup.
    heading = soup.find_all("h3")

    #print(heading) <--- result...

    #...ptonic organization!
    n=len(heading)
    for x in range(n): 
        print(str.strip(heading[x].text))

Dependencies: On terminal (linux):

sudo apt-get install python3-bs4

answered Oct 03 '22 11:10

Fabiano Rocha

Related questions
                            
                                Resampling and filling missing data in pandas
                            
                                Deep set python dictionary
                            
                                Python Argparse - Set default value of a parameter to another parameter
                            
                                How do install packages from a local python package index?
                            
                                Default Argument decorator python
                            
                                Pandas SQL equivalent for 'not equal' clause
                            
                                O(n) solution for finding maximum sum of differences python 3.x?
                            
                                Keras Extremely High Loss
                            
                                How to know from python if Windows path limit has been removed
                            
                                Python exit from all running threads on truthy condition
                            
                                Splitting list of dictionary into sublists after the occurence of particular key of dictionary
                            
                                Data Normalization with tensorflow tf-transform
                            
                                hog() got an unexpected keyword argument 'visualize'
                            
                                Comparing two pandas series for floating point near-equality?
                            
                                Python : upload my own files into my drive using Pydrive library
                            
                                Generate URLs for Flask test client with url_for function
                            
                                django.urls.exceptions.NoReverseMatch: Reverse for 'sign_up' not found. 'sign_up' is not a valid view function or pattern name
                            
                                Abstract matrix multiplication with variables
                            
                                transparent background in gif using Python Imageio
                            
                                How do i take picture from client side(html) and save it to server side(Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scraping text in h3 and div tags using beautifulSoup, Python

Tags:

python

html

beautifulsoup

selenium

web-crawler

Revaapriyan

People also ask

2 Answers

Anonta

Fabiano Rocha

Recent Activity

Donate For Us