Trying to get my head around Scrapy but hitting a few dead ends.
I have a 2 Tables on a page and would like to extract the data from each one then move along to the next page.
Tables look like this (First one is called Y1, 2nd is Y2) and structures are the same.
<div id="Y1" style="margin-bottom: 0px; margin-top: 15px;">
<h2>First information</h2><hr style="margin-top: 5px; margin-bottom: 10px;">
<table class="table table-striped table-hover table-curved">
<thead>
<tr>
<th class="tCol1" style="padding: 10px;">First Col Head</th>
<th class="tCol2" style="padding: 10px;">Second Col Head</th>
<th class="tCol3" style="padding: 10px;">Third Col Head</th>
</tr>
</thead>
<tbody>
<tr>
<td>Info 1</td>
<td>Monday 5 September, 2016</td>
<td>Friday 21 October, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 2</b></td>
<td class="dtstart" timestamp="1477094400"><b></b></td>
<td class="dtend" timestamp="1477785600">
<b>Sunday 30 October, 2016</b></td>
</tr>
<tr>
<td>Info 3</td>
<td>Monday 31 October, 2016</td>
<td>Tuesday 20 December, 2016</td>
</tr>
<tr class="vevent">
<td class="summary"><b>Info 4</b></td>
<td class="dtstart" timestamp="1482278400"><b>Wednesday 21 December, 2016</b></td>
<td class="dtend" timestamp="1483315200">
<b>Monday 2 January, 2017</b></td>
</tr>
</tbody>
</table>
As you can see, the structure is a little inconsistent but as long as I can get each td and output to csv then I'll be a happy guy.
I tried using xPath but this only confused me more.
My last attempt:
import scrapy
class myScraperSpider(scrapy.Spider):
name = "myScraper"
allowed_domains = ["mysite.co.uk"]
start_urls = (
'https://mysite.co.uk/page1/',
)
def parse_products(self, response):
products = response.xpath('//*[@id="Y1"]/table')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[1]').extract()[0]
item['first'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[2]').extract()[0]
item['last'] = product.xpath('//*[@id="Y1"]/table/tbody/tr[1]/td[3]').extract()[0]
yield item
No errors here but it just fires back lots of information about the crawl but no actual results.
Update:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = (
'https://termdates.co.uk/school-holidays-16-19-abingdon/',
)
def parse_products(self, response):
products = sel.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
This give me: IndentationError: unexpected indent
if I run the amended script below (thanks to @Granitosaurus) to output to CSV (-o schoolDates.csv) I get an empty file:
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse_products(self, response):
products = sel.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[1]/text()').extract_first()
item['last'] = p.xpath('td[1]/text()').extract_first()
yield item
This is the log:
Update 2: (Skips row) This pushes result to csv file but skips every other row.
The Shell shows {'hol': None, 'last': u'\r\n\t\t\t\t\t\t\t\t', 'first': None}
import scrapy
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',)
def parse(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
for p in products[1:]:
item = dict()
item['hol'] = p.xpath('td[1]/text()').extract_first()
item['first'] = p.xpath('td[2]/text()').extract_first()
item['last'] = p.xpath('td[3]/text()').extract_first()
yield item
Solution: Thanks to @vold This crawls all pages in start_urls and deals with the inconsistent table layout
# -*- coding: utf-8 -*-
import scrapy
from SchoolDates_1.items import Schooldates1Item
class SchoolSpider(scrapy.Spider):
name = "school"
allowed_domains = ["termdates.co.uk"]
start_urls = ('https://termdates.co.uk/school-holidays-16-19-abingdon/',
'https://termdates.co.uk/school-holidays-3-dimensions',)
def parse(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]:
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = ''.join(product.xpath('td[3]//text()').extract()).strip()
item['url'] = response.url
yield item
You then run your spider using the runspider command passing the argument -o telling scrapy to place extracted data into output. json file. Scrapy will create a file output. json file in the directory where you run your spider and it will export your extracted data into JSON format and place it in this file.
HTML-Table Scraper is a simple Extension that allows you to right click any HTML table and select to copy the data in that table to the Clipboard or Save-to-File in either CSV or Tab delimited. Holding CTRL Down while Right Click will automatically Copy Table to Clipboard without selecting the context menu option.
Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you'll learn how to get started with Scrapy and you'll also implement an example project to scrape an e-commerce website.
You can use CSS Selectors instead of xPaths, I always find CSS Selectors easy.
def parse_products(self, response):
for table in response.css("#Y1 table")[1:]:
item = Schooldates1Item()
item['hol'] = product.css('td:nth-child(1)::text').extract_first()
item['first'] = product.css('td:nth-child(2)::text').extract_first()
item['last'] = product.css('td:nth-child(3)::text').extract_first()
yield item
Also do not use tbody
tag in selectors. Source:
Firefox, in particular, is known for adding elements to tables. Scrapy, on the other hand, does not modify the original page HTML, so you won’t be able to extract any data if you use in your XPath expressions.
You need to slightly correct your code. Since you already select all elements within the table you don't need to point again to a table. Thus you can shorten your xpath to something like thistd[1]//text()
.
def parse_products(self, response):
products = response.xpath('//*[@id="Year1"]/table//tr')
# ignore the table header row
for product in products[1:]
item = Schooldates1Item()
item['hol'] = product.xpath('td[1]//text()').extract_first()
item['first'] = product.xpath('td[2]//text()').extract_first()
item['last'] = product.xpath('td[3]//text()').extract_first()
yield item
Edited my answer since @stutray provide the link to a site.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With