I'm trying to scrape pages using a Scrapy spider and then save those pages into a .txt file in a readable form. The code I'm using to do this is:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
title = hxs.select('/html/head/title/text()').extract()
content = hxs.select('//*[@id="content"]').extract()
texts = "%s\n\n%s" % (title, content)
soup = BeautifulSoup(''.join(texts))
strip = ''.join(BeautifulSoup(pretty).findAll(text=True))
filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title)
filly = open(filename, "w")
filly.write(strip)
I've combined BeautifulSoup here because the body text contains a lot of HTML that I don't want in the final product (primarily links), so I use BS to strip out the HTML and leave only the text that is of interest.
This gives me output that looks like
[u"School, Chandler's Ford (Hansard, 30 November 1961)"]
[u'
\n \n
HC Deb 30 November 1961 vol 650 cc608-9
\n
608
\n
\n
\n
\n
\xa7
\n
28.
\n
Dr. King
\n
\n asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n
\n
\n
\n \n
\n
\n
\n
\xa7
\n
Sir D. Eccles
\n
\n I understand that the authority has paid \xa375,000 for this site.\n \n
While I want the output to look like:
School, Chandler's Ford (Hansard, 30 November 1961)
HC Deb 30 November 1961 vol 650 cc608-9
608
28.
Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.
Sir D. Eccles I understand that the authority has paid £375,000 for this site.
So I'm basically looking for how to remove the newline indicators \n
, tighten everything up, and convert any special characters to their normal format.
My answer in comments for code:
import re
import codecs
#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]
#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)
texts = "%s\n\n%s" % (title, cleaned_content)
#look's like typo in filename creation
#filename ....
#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With