Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python - Easiest way to scrape text from list of URLs using BeautifulSoup

What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? Is it even possible?

Best, Georgina

like image 494
Georgina Avatar asked Mar 16 '11 20:03

Georgina


People also ask

How do you scrape data from a list of URLs?

To scrape by using a list of URLs, we'll simply set up a loop of all the URLs we need to scrape from then add a data extraction action right after it to get the data we need. Octoparse will load the URLs one by one and scrape the data from each page.

How do I scrape all text from a website?

Click and drag to select the text on the Web page you want to extract and press “Ctrl-C” to copy the text. Open a text editor or document program and press “Ctrl-V” to paste the text from the Web page into the text file or document window. Save the text file or document to your computer.


1 Answers

import urllib2
import BeautifulSoup
import re

Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urllib2.urlopen(url).read()
    # parse as html structured document
    bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    # kill javascript content
    for s in bs.findAll('script'):
        s.replaceWith('')
    # find body and extract text
    txt = bs.find('body').getText('\n')
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

It now removes javascript and decodes html entities.

like image 116
Hugh Bothwell Avatar answered Oct 05 '22 22:10

Hugh Bothwell