I want to write to csv file in scrapy
for rss in rsslinks:
item = AppleItem()
item['reference_link'] = response.url
base_url = get_base_url(response)
item['rss_link'] = urljoin_rfc(base_url,rss)
#item['rss_link'] = rss
items.append(item)
#items.append("\n")
f = open(filename,'a+') #filename is apple.com.csv
for item in items:
f.write("%s\n" % item)
My output is this:
{'reference_link': 'http://www.apple.com/'
'rss_link': 'http://www.apple.com/rss '
{'reference_link': 'http://www.apple.com/rss/'
'rss_link': 'http://ax.itunes.apple.com/WebObjects/MZStore.woa/wpa/MRSS/newreleases/limit=10/rss.xml'}
{'reference_link': 'http://www.apple.com/rss/'
'rss_link': 'http://ax.itunes.apple.com/WebObjects/MZStore.woa/wpa/MRSS/newreleases/limit=25/rss.xml'}
What I want is this format:
reference_link rss_link
http://www.apple.com/ http://www.apple.com/rss/
Saving CSV Files Via The Command Line The first and simplest way to create a CSV file of the data you have scraped, is to simply define a output path when starting your spider in the command line. To save to a CSV file add the flag -o to the scrapy crawl command along with the file path you want to save the file to.
In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
simply crawl with -o csv
, like:
scrapy crawl <spider name> -o file.csv -t csv
This is what worked for me using Python3:
scrapy runspider spidername.py -o file.csv -t csv
Best approach to solve this problem is to use python in-build csv package.
import csv
file_name = open('Output_file.csv', 'w') #Output_file.csv is name of output file
fieldnames = ['reference_link', 'rss_link'] #adding header to file
writer = csv.DictWriter(file_name, fieldnames=fieldnames)
writer.writeheader()
for rss in rsslinks:
base_url = get_base_url(response)
writer.writerow({'reference_link': response.url, 'rss_link': urljoin_rfc(base_url, rss)}) #writing data into file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With