I want to write my custom scrapy link extractor for extracting links. The scrapy documentation says it has two built-in extractors. http://doc.scrapy.org/en/latest/topics/link-extractors.html But i haven't seen any code example of how can i implement by custom link extractor, can someone give some example of writing custom extractor?

I had a hard time to find recent examples for this, so I decided to post my walkthrough of the process of writing a custom link extractor. <h3>The reason why I decided to create a custom link extractor</h3> I had a problem with crawling a website that had href urls that had spaces, tabs and line breaks, like such: <pre class="prettyprint"><code><a href=" /something/something.html " /> </code></pre> Supposing the page that had this link was at: http://example.com/something/page.html Instead of transforming this href url into: http://example.com/something/something.html Scrapy transformed it into: http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20 And this was causing an infinite loop, as the crawler would go deeper and deeper on those badly interpreted urls. I tried to use the <code>process_value</code> and <code>process_links</code> params of <code>LxmlLinkExtractor</code>, as suggested here without luck, so I decided to patch the method that processes relative urls. <h3>Finding the original code</h3> At the current version of Scrapy (1.0.3), the recommended link extractor is the <code>LxmlLinkExtractor</code>. If you want to extend <code>LxmlLinkExtractor</code>, you should check out how the code goes on the Scrapy version that you are using. You can probably open your currently used scrapy code location by running, from the command line (on OS X): <pre class="prettyprint"><code>open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"') </code></pre> In the version that I use (1.0.3) the code of <code>LxmlLinkExtractor</code> is in: <pre class="prettyprint"><code>scrapy/linkextractors/lxmlhtml.py </code></pre> There I saw that the method I needed to adapt was <code>_extract_links()</code> inside <code>LxmlParserLinkExtractor</code>, that is then used by <code>LxmlLinkExtractor</code>. So I extended <code>LxmlLinkExtractor</code> and <code>LxmlParserLinkExtractor</code> with slightly modified classes called <code>CustomLinkExtractor</code> and <code>CustomLxmlParserLinkExtractor</code>. The single line I modified is commented out. <pre class="prettyprint"><code># Import everything from the original lxmlhtml from scrapy.linkextractors.lxmlhtml import * _collect_string_content = etree.XPath("string()") # Extend LxmlParserLinkExtractor class CustomParserLinkExtractor(LxmlParserLinkExtractor): def _extract_links(self, selector, response_url, response_encoding, base_url): links = [] for el, attr, attr_val in self._iter_links(selector._root): # Original method was: # attr_val = urljoin(base_url, attr_val) # So I just added a .strip() attr_val = urljoin(base_url, attr_val.strip()) url = self.process_attr(attr_val) if url is None: continue if isinstance(url, unicode): url = url.encode(response_encoding) # to fix relative links after process_value url = urljoin(response_url, url) link = Link(url, _collect_string_content(el) or u'', nofollow=True if el.get('rel') == 'nofollow' else False) links.append(link) return unique_list(links, key=lambda link: link.url) \ if self.unique else links # Extend LxmlLinkExtractor class CustomLinkExtractor(LxmlLinkExtractor): def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(), tags=('a', 'area'), attrs=('href',), canonicalize=True, unique=True, process_value=None, deny_extensions=None, restrict_css=()): tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs)) tag_func = lambda x: x in tags attr_func = lambda x: x in attrs # Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func, unique=unique, process=process_value) super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny, allow_domains=allow_domains, deny_domains=deny_domains, restrict_xpaths=restrict_xpaths, restrict_css=restrict_css, canonicalize=canonicalize, deny_extensions=deny_extensions) </code></pre> And when defining the rules, I use <code>CustomLinkExtractor</code>: <pre class="prettyprint"><code>from scrapy.spiders import Rule rules = ( Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True), ) </code></pre>

How can i write my custom link extractor in scrapy python

2 Answers

This is the example of custom link extractor

class RCP_RegexLinkExtractor(SgmlLinkExtractor):
    """High performant link extractor"""

    def _extract_links(self, response_text, response_url, response_encoding, base_url=None):
        if base_url is None:
            base_url = urljoin(response_url, self.base_url) if self.base_url else response_url

        clean_url = lambda u: urljoin(base_url, remove_entities(clean_link(u.decode(response_encoding))))
        clean_text = lambda t: replace_escape_chars(remove_tags(t.decode(response_encoding))).strip()

        links_text = linkre.findall(response_text)
        urlstext = set([(clean_url(url), clean_text(text)) for url, _, text in links_text])

        return [Link(url, text) for url, text in urlstext]

Usage

rules = (
    Rule(
        RCP_RegexLinkExtractor(
            allow=(r"epolls/2012/president/[a-z]{2}/[a-z]+_romney_vs_obama-[0-9]{4}\.html"),
            # Regex explanation:
            #     [a-z]{2} - matches a two character state abbreviation
            #     [a-z]*   - matches a state name
            #     [0-9]{4} - matches a 4 number unique webpage identifier

            allow_domains=('realclearpolitics.com',),
        ),
        callback='parseStatePolls',
        # follow=None, # default 
        process_links='processLinks',
        process_request='processRequest',
    ),
)

have a look at here https://github.com/jtfairbank/RCP-Poll-Scraper

174

answered Oct 03 '22 00:10

Mirage

I had a hard time to find recent examples for this, so I decided to post my walkthrough of the process of writing a custom link extractor.

The reason why I decided to create a custom link extractor

I had a problem with crawling a website that had href urls that had spaces, tabs and line breaks, like such:

<a href="
       /something/something.html
         " />

Supposing the page that had this link was at:

http://example.com/something/page.html

Instead of transforming this href url into:

http://example.com/something/something.html

Scrapy transformed it into:

http://example.com/something%0A%20%20%20%20%20%20%20/something/something.html%0A%20%20%20%20%20%20%20

And this was causing an infinite loop, as the crawler would go deeper and deeper on those badly interpreted urls.

I tried to use the process_value and process_links params of LxmlLinkExtractor, as suggested here without luck, so I decided to patch the method that processes relative urls.

Finding the original code

At the current version of Scrapy (1.0.3), the recommended link extractor is the LxmlLinkExtractor.

If you want to extend LxmlLinkExtractor, you should check out how the code goes on the Scrapy version that you are using.

You can probably open your currently used scrapy code location by running, from the command line (on OS X):

open $(python -c 'import site; print site.getsitepackages()[0] + "/scrapy"')

In the version that I use (1.0.3) the code of LxmlLinkExtractor is in:

scrapy/linkextractors/lxmlhtml.py

There I saw that the method I needed to adapt was _extract_links() inside LxmlParserLinkExtractor, that is then used by LxmlLinkExtractor.

So I extended LxmlLinkExtractor and LxmlParserLinkExtractor with slightly modified classes called CustomLinkExtractor and CustomLxmlParserLinkExtractor. The single line I modified is commented out.

# Import everything from the original lxmlhtml
from scrapy.linkextractors.lxmlhtml import *
_collect_string_content = etree.XPath("string()")

# Extend LxmlParserLinkExtractor
class CustomParserLinkExtractor(LxmlParserLinkExtractor):

    def _extract_links(self, selector, response_url, response_encoding, base_url):
        links = []
        for el, attr, attr_val in self._iter_links(selector._root):

            # Original method was:
            # attr_val = urljoin(base_url, attr_val)
            # So I just added a .strip()

            attr_val = urljoin(base_url, attr_val.strip())

            url = self.process_attr(attr_val)
            if url is None:
                continue
            if isinstance(url, unicode):
                url = url.encode(response_encoding)
            # to fix relative links after process_value
            url = urljoin(response_url, url)
            link = Link(url, _collect_string_content(el) or u'',
                nofollow=True if el.get('rel') == 'nofollow' else False)
            links.append(link)

        return unique_list(links, key=lambda link: link.url) \
                if self.unique else links


# Extend LxmlLinkExtractor
class CustomLinkExtractor(LxmlLinkExtractor):

    def __init__(self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()):
        tags, attrs = set(arg_to_iter(tags)), set(arg_to_iter(attrs))
        tag_func = lambda x: x in tags
        attr_func = lambda x: x in attrs

        # Here I replaced the original LxmlParserLinkExtractor with my CustomParserLinkExtractor
        lx = CustomParserLinkExtractor(tag=tag_func, attr=attr_func,
            unique=unique, process=process_value)

        super(LxmlLinkExtractor, self).__init__(lx, allow=allow, deny=deny,
            allow_domains=allow_domains, deny_domains=deny_domains,
            restrict_xpaths=restrict_xpaths, restrict_css=restrict_css,
            canonicalize=canonicalize, deny_extensions=deny_extensions)

And when defining the rules, I use CustomLinkExtractor:

from scrapy.spiders import Rule


rules = (

    Rule(CustomLinkExtractor(canonicalize=False, allow=[('^https?\:\/\/example\.com\/something\/.*'),]), callback='parse_item', follow=True),

  )

answered Oct 03 '22 02:10

Ivan Chaer

Related questions
                            
                                Python Error The _posixsubprocess module is not being used
                            
                                Django on Heroku dumpdata incomplete output
                            
                                Python cartesian product of n lists with n unknown at coding time
                            
                                Converting an image in pygame to an 2D array of RGB values
                            
                                Python scipy.optimize: Using fsolve with multiple first guesses
                            
                                object reuse in python doctest
                            
                                logging flask errors with mod_wsgi
                            
                                python os.fdopen(os.open()) can't be used for writing?
                            
                                App Engine Python Development Server + Taskqueue + Backend
                            
                                AWS glacier delete job
                            
                                Pycharm - How do I access the "Watches" pane?
                            
                                IPython support on Emacs 24.x
                            
                                PyInstaller packaged application works fine in Console mode, crashes in Window mode
                            
                                How do I get tomorrow's date in Python?
                            
                                Key Error 4 in Python
                            
                                Django Circular Model Dependency
                            
                                Replace First and Last Word of String in the Most Pythonic Way
                            
                                Django session race condition?
                            
                                Read Celery configuration from Python properties file
                            
                                recv() in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can i write my custom link extractor in scrapy python

Tags:

python

scrapy

Mirage

People also ask

2 Answers

Mirage

The reason why I decided to create a custom link extractor

Finding the original code

Ivan Chaer

Recent Activity

Donate For Us