Python - Easiest way to scrape text from list of URLs using BeautifulSoup

1 Answers

import urllib2
import BeautifulSoup
import re

Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urllib2.urlopen(url).read()
    # parse as html structured document
    bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    # kill javascript content
    for s in bs.findAll('script'):
        s.replaceWith('')
    # find body and extract text
    txt = bs.find('body').getText('\n')
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

It now removes javascript and decodes html entities.

116

answered Oct 05 '22 22:10

Hugh Bothwell

Related questions
                            
                                Whats the pythonic way to handle empty *args when creating a set?
                            
                                Creating sets of default values for Matplotlib
                            
                                converting white space in python files?
                            
                                Shortest way to convert these bytes to int in python?
                            
                                Python: Converting ('Monday', 'Tuesday', 'Wednesday') to 'Monday to Wednesday'
                            
                                Running a Python Script using Cron?
                            
                                Modify subclassed string in place
                            
                                How do I display add model in tabular format in the Django admin?
                            
                                Problems installing PyCurl on python2.7.0+
                            
                                Processing messages from a child process thorough stderr and stdout with Python
                            
                                What are the various Python CMS's and their statuses?
                            
                                most efficient way to find partial string matches in large file of strings (python)
                            
                                Many-to-many declarative SQLAlchemy definition for users, groups, and roles
                            
                                Why is it not possible to get a Py_buffer from an array object?
                            
                                Grouping a series in Python
                            
                                Why does refs increase 2 for every new object in Python?
                            
                                How to color surface with stronger contrast
                            
                                Why is an instance of webapp.WSGIApplication always defined as a global variable in google app engine code?
                            
                                Where to Put Python Utils Folder?
                            
                                Using threading to keep FTP control port alive

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - Easiest way to scrape text from list of URLs using BeautifulSoup

Tags:

python

beautifulsoup

web-scraping

screen-scraping

Georgina

People also ask

1 Answers

Hugh Bothwell

Recent Activity

Donate For Us