Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape a whole website using beautifulsoup

I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. Here's my first approach:

I need to fetch the data out of this page: http://europa.eu/youth/volunteering/evs-organisation_en

Firstly, I do a view on the page source to find HTML elements? view-source:https://europa.eu/youth/volunteering/evs-organisation_en

Note: I need to fetch the data that comes right below this line:

EVS accredited organisations search results: 6066

I chose beautiful soup for this job - since it is very powerful:

I Use find_all:

soup.find_all('p')[0].get_text() # Searching for tags by class and id

Note: Classes and IDs are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.

See the class:

                  <div class="col-md-4">
            <div class="vp ey_block block-is-flex">
  <div class="ey_inner_block">
    <h4 class="text-center"><a href="/youth/volunteering/organisation/935175449_en" target="_blank">&quot;People need people&quot; Zaporizhya oblast civic organisation of disabled families</a></h4>
            <p class="ey_info">
    <i class="fa fa-location-arrow fa-lg"></i>
    Zaporizhzhya, <strong>Ukraine</strong>
</p>    <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Sending</p>
                  <p><strong>PIC no:</strong> 935175449</p>
        <div class="empty-block">
      <a href="/youth/volunteering/organisation/935175449_en" target="_blank" class="ey_btn btn btn-default pull-right">Read more</a>    </div>
  </div>

so this leads to:

# import libraries
import urllib2
from bs4 import BeautifulSoup
page = requests.get("https://europa.eu/youth/volunteering/evs-organisation_en")
soup = BeautifulSoup(page.content, 'html.parser')
soup

Now, we can use the find_all method to search for items by class or by id. In the below example, we'll search for any p tag that has the class outer-text

<div class="col-md-4">

so we choose:

soup.find_all(class_="col-md-4")

Now I have to combine all.

update: my approach: so far:

I have extracted data wrapped within multiple HTML tags from a webpage using BeautifulSoup4. I want to store all of the extracted data in a list. And - to be more concrete: I want each of the extracted data as separate list elements separated by a comma (i.e.CSV-formated).

To begin with the beginning:

here we have the HTML content structure:

 <div class="view-content">
            <div class="row is-flex"></span>
                 <div class="col-md-4"></span>
            <div class </span>
  <div class= >
    <h4 Data 1 </span>
          <div class= Data 2</span>
            <p class=
    <i class=
     <strong>Data 3 </span>
</p>    <p class= Data 4 </span>
          <p class= Data 5 </span>
                  <p><strong>Data 6</span>
        <div class=</span>
      <a href="Data 7</span>
  </div>
</div>

Code to extract:

for data in elem.find_all('span', class_=""):

This should give an output:

data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)

Output: [' Data 1 ', ' Data 2 ', ' Data 3 ' and so forth]

question: / i need help with the extraction part...

like image 991
malaga Avatar asked Nov 08 '22 11:11

malaga


1 Answers

try this

data = [ele.text for ele in soup.find_all(text = True) if ele.text.strip() != '']
print(data)
like image 132
Sushant Kathuria Avatar answered Nov 14 '22 21:11

Sushant Kathuria