Formatting text output with Scrapy in Python

Question

I'm trying to scrape pages using a Scrapy spider and then save those pages into a .txt file in a readable form. The code I'm using to do this is:

def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) 

        hxs = HtmlXPathSelector(response)

        title = hxs.select('/html/head/title/text()').extract() 
        content = hxs.select('//*[@id="content"]').extract() 

        texts = "%s

%s" % (title, content) 

        soup = BeautifulSoup(''.join(texts)) 

        strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) 

        filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) 
        filly = open(filename, "w")
        filly.write(strip)

I've combined BeautifulSoup here because the body text contains a lot of HTML that I don't want in the final product (primarily links), so I use BS to strip out the HTML and leave only the text that is of interest.

This gives me output that looks like

[u"School, Chandler's Ford (Hansard, 30 November 1961)"]

[u'

 
      


  HC Deb 30 November 1961 vol 650 cc608-9

 


  608

 


  



  


   


    \xa7

   


    28.

   



     Dr. King


   


    
            asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.


   


  


 
      


  



  


   


    \xa7

   



     Sir D. Eccles


   


    
            I understand that the authority has paid \xa375,000 for this site.

While I want the output to look like:

    School, Chandler's Ford (Hansard, 30 November 1961)

          HC Deb 30 November 1961 vol 650 cc608-9

          608

            28.

Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.

Sir D. Eccles I understand that the authority has paid £375,000 for this site.

So I'm basically looking for how to remove the newline indicators , tighten everything up, and convert any special characters to their normal format.

reclosedev · Accepted Answer

My answer in comments for code:

import re
import codecs

#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]

#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)

texts = "%s

%s" % (title, cleaned_content) 

#look's like typo in filename creation
#filename ....

#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
    output.write(texts)

Formatting text output with Scrapy in Python

Tags:

python

text

web-scraping

scrapy

user1074057

1 Answers

reclosedev

Recent Activity

Donate For Us

Formatting text output with Scrapy in Python

Tags:

python

text

web-scraping

scrapy

user1074057

1 Answers

reclosedev

Related questions

Recent Activity

Donate For Us