I want to write my custom scrapy link extractor for extracting links.
The scrapy documentation says it has two built-in extractors.
http://doc.scrapy.org/en/latest/topics/link-extractors.html
But i haven't seen any code example of how can i implement by custom link extractor, can someone give some example of writing custom extractor?
A link extractor is an object that extracts links from responses. The __init__ method of LxmlLinkExtractor takes settings that determine which links may be extracted. LxmlLinkExtractor. extract_links returns a list of matching Link objects from a Response object.
In Scrapy, there are built-in extractors such as scrapy.linkextractors import LinkExtractor. You can customize your own link extractor according to your needs by implementing a simple interface. Every link extractor has a public method called extract_links which includes a Response object and returns a list of scrapy.link.Link objects.
Every link extractor has a public method called extract_links which includes a Response object and returns a list of scrapy.link.Link objects. You can instantiate the link extractors only once and call the extract_links method various times to extract links with different responses.
A link extractor is an object that extracts links from responses. The __init__ method of LxmlLinkExtractor takes settings that determine which links may be extracted. LxmlLinkExtractor.extract_links returns a list of matching Link objects from a Response object. Link extractors are used in CrawlSpider spiders through a set of Rule objects.
A link extractor is an object that extracts links from responses. The __init__ method of LxmlLinkExtractor takes settings that determine which links may be extracted. LxmlLinkExtractor.extract_links returns a list of matching Link objects from a Response object.
This is the example of custom link extractor
class RCP_RegexLinkExtractor(SgmlLinkExtractor):
"""High performant link extractor"""
def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
if base_url is None:
base_url = urljoin(response_url, self.base_url) if self.base_url else response_url
clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()
links_text = linkre.findall(response_text)
urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])
return [Link(url, text) for url, text in urlstext]
Usage
rules = (
Rule(
RCP_RegexLinkExtractor(
allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
# Regex explanation:
# [a-z]{2} - matches a two character state abbreviation
# [a-z]* - matches a state name
# [0-9]{4} - matches a 4 number unique webpage identifier
allow_domains=('realclearpolitics.com',),
),
callback='parseStatePolls',
# follow=None, # default
process_links='processLinks',
process_request='processRequest',
),
)
have a look at here https://github.com/jtfairbank/RCP-Poll-Scraper
I had a hard time to find recent examples for this, so I decided to post my walkthrough of the process of writing a custom link extractor.
I had a problem with crawling a website that had href urls that had spaces, tabs and line breaks, like such:
<a href="
/something/something.html
" />
Supposing the page that had this link was at:
http://example.com/something/page.html
Instead of transforming this href url into:
http://example.com/something/something.html
Scrapy transformed it into:
http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20
And this was causing an infinite loop, as the crawler would go deeper and deeper on those badly interpreted urls.
I tried to use the process_value
and process_links
params of LxmlLinkExtractor
, as suggested here without luck, so I decided to patch the method that processes relative urls.
At the current version of Scrapy (1.0.3), the recommended link extractor is the LxmlLinkExtractor
.
If you want to extend LxmlLinkExtractor
, you should check out how the code goes on the Scrapy version that you are using.
You can probably open your currently used scrapy code location by running, from the command line (on OS X):
open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')
In the version that I use (1.0.3) the code of LxmlLinkExtractor
is in:
scrapy/linkextractors/lxmlhtml.py
There I saw that the method I needed to adapt was _extract_links()
inside LxmlParserLinkExtractor
, that is then used by LxmlLinkExtractor
.
So I extended LxmlLinkExtractor
and LxmlParserLinkExtractor
with slightly modified classes called CustomLinkExtractor
and CustomLxmlParserLinkExtractor
. The single line I modified is commented out.
# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")
# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):
def _extract_links(self, selector, response_url, response_encoding, base_url):
links = []
for el, attr, attr_val in self._iter_links(selector._root):
# Original method was:
# attr_val = urljoin(base_url, attr_val)
# So I just added a .strip()
attr_val = urljoin(base_url, attr_val.strip())
url = self.process_attr(attr_val)
if url is None:
continue
if isinstance(url, unicode):
url = url.encode(response_encoding)
# to fix relative links after process_value
url = urljoin(response_url, url)
link = Link(url, _collect_string_content(el) or u'',
nofollow=True if el.get('rel') == 'nofollow' else False)
links.append(link)
return unique_list(links, key=lambda link: link.url) \
if self.unique else links
# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):
def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
tags=('a', 'area'), attrs=('href',), canonicalize=True,
unique=True, process_value=None, deny_extensions=None, restrict_css=()):
tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
tag_func = lambda x: x in tags
attr_func = lambda x: x in attrs
# Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
unique=unique, process=process_value)
super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
allow_domains=allow_domains, deny_domains=deny_domains,
restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
canonicalize=canonicalize, deny_extensions=deny_extensions)
And when defining the rules, I use CustomLinkExtractor
:
from scrapy.spiders import Rule
rules = (
Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With